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© A method encoding texL 



@ This invention is a method of representing the 
text in a document in a way that enables very fast 
text processing on digital computers. More specifi- 
cally, each word of text is represented as a number 
(or token) that refers to an information packet de- 
scribing the word's characteristics. Operations then 
process each token, rather than each character, to 
perform text processing functions. In addition to the 



compact nature of this representation, the perfor- 
marKe of virtually all functions in a 'what-you-see-is- 
what-you-get' (WYSIWYG) editor are improved. In 
particular, determining line breaks and displaying 
text are significantly faster when the text is pro- 
cessed a token at a time rather than a character at. a 
time. 
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A method 



In a typical word processing system each para- 
graph exists internally as one or more strings of 
characters, and must be broken into lines before it 
can be displayed or printed. For example, the 
typical line-breaking algorithm has a main inner 
loop which adds the width of the current character 
to the sum of the widths of previous characters, 
and compares the new total to the desired line 
width. The program will execute this loop until the 
number of characters in the line exceeds the num- 
ber of characters that can be fitted In the line. At 
this point, the program can either end the line with 
the last full word, or hyphenate the current word 
and put the word portion after the hyphen at the 
beginning of tfie next line. 

Two problems with this process cause it to run 
too slowly: first the inner loop must be executed 
for every character in the line; second, if hyphen- 
ation is enabled, the context of the character that 
overran the margin must be deduced - that is. a 
detennination must be made whether the character 
is a space, punctuation mark, or part of a word. In 
general, all operatk}ns that require processing of 
each character such as pagination and scrolling 
through the document are very slow. In addition, 
operations that depend on the interpretatkjn of the 
document as a sequence of words, such as hy- 
phenation, spell-checking and search and replace 
are also very slow. 

US-A-4.1 81.972 relates to a means and meth- 
ods for automatic hyphenation of words and dis- 
closes a means responsive to the length of input 
words, rather than characters. However this inven- 
tion does not store the word length obtained for 
future use; at the time that hyphenation is re- 
quested, it scans the entire text character-by-char- 
acter. It also does not compute breakpoints based 
on the whole word length, instead. Casey teaches 
the use of a memory-based table of valid break- 
points t^etwean consonant/vowel combinations. 

US-A-4.092.729 and 4.028.677 relate to meth- 
ods of hyphenation also based on a memory table 
of tweakpoints. 729 accomplishes hyphenation 
based on word length (see claim 6) but the method 
disclosed is different than the invention disclosed 
here. In it, words are assembled from characters at 
the time hyphenation is requested, and then com- 
pared to a dictionary containing words with break- 
points. The invention disclosed here assembles the 
words at the time the document is encoded, and 
does not use a dictionary look-up technique while 
linebreaks are computed. 

What is required is a better method of repre- 
senting the text for document processing. A natural 
approach for reducing the computational intensity 



encoding text 

of the composition function would be to create data 
structures that would enable computation a word at 
a time rather than a character at a time. The 
internal representation of the text, in this case, is a 
5 token which is defined as the pair: 
<type. data> 

where the type is a unique identifier for each class 
of token, and data are the data associated with a 
particular type of token. A token can tie repre- 
10 sented in a more compact way as 
<type.polnter> 

where the pointer is the address of the data asso- 
ciated with that token. This form of the token is 
more easily manipulated since entries are the 

15 same length. An even more compact representa- 
tion of a token is achieved when the token type is 
included in the data block; this reduces the fun- 
damental token object to a pointer. Since the type 
information is still present In the datablock, a point- 

20 er of this form is still appropriately referred to as a 
token. In the past, several approaches used an 
internal representation of text that was some form 
of token, and all had drawbacks that prevented 
them from being applied to rapid text composition. 

2S Numerous known systems have used tokens 
for editing computer programs. See, for example: 
'Copilot A Multiple Process Approach to Interactive 
Programming Systems,' Daniel Carl, July 1974, 
PhD thesis. Stanford University. Swinehart uses 

30 tokens to maintain a relationship between the 
source code <text) and the corresponding parse 
tree that the compiler uses to translate the program 
into machine instructions. After each editing opera- 
tion tt>e lines of source code that changed are 

35 rescanned into tokens, the parse tree is rebuilt and 
finally, the parse tree is inspected for conrectness. 
These systems are very popular for creating and 
modifying programs written in languages like Lisp, 
but tend to be fairly slow and laborious. The benefit 

40 to the user is that there is a greater likelihood tiiat 
the changes made to a program will result in errors 
being removed rather than introduced. 

A second known approach uses tokens as the 
fundamental text unit to represent English words 

4s rather than elements of a computer programming 
language. In Lexicontext: 'A Dictionary-Based Text 
Processing System,' John Francis Haverty. August 
1971. MSG thesis. Massachusetts Institute of Tech- 
nology, a token points to a lexicon entry containing 

so the text for the word; a hashing function is then 
used to retrieve the data associated with the entry 
which can be uniquely defined for each token. This 
encoding method is very general, but at the ex- 
pense of performance. 

Furthennore. since a principal application of 
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Haverty's method is as a natural language interface 
to an operating system, the lexicon is global and 
thus independent of any particular document. ITiis 
architecture is practical in an environment where 
the information Is processed- on a single central 
processor and when the entire universe of words 
that would be encountered is known in advance. 
Even if words could be added to the global lexicon, 
there would still be problems in a distributed envi- 
ronment where processors may not be connected 
to a network or other communications devices. In 
this case, the lexicons would quickly diverge, and 
documents created on one machine could not be 
con'ectly interpreted on any other machine. Another 
major drawback of this approach Is that if an error 
is detected In the main lexicon all of the docu- 
ments encoded with the flawed lexicon wouW need 
to be reprocessed - If It was even possible to 
rebuild the documerrts. Because the main lexicon 
must by design be very large, it would be imprac- 
tical to maintain the lexicon as resident in main 
memory. A large lexicon not resident in main mem- 
ory would impose a tremendous performance pen- 
alty. 

This invention is a rrwthod of using tokens to 
represent text in a way that is specifically designed 
for efficiently editing text specifk:aliy when applied 
to WYSIWYG editors. Rather than the tree-oriented 
structure that is used in the computer program 
editors, a simple linked list is used. The tokens 
point directly to the data associated with the token, 
thus eliminating the hashing function and, although 
the date blocks are of variable length, the data 
bkicks are uniformly defined for all tokens. The 
dictionaries are local to each document, leading to 
a system that is well suited to distributed environ- 
ments. J 

The technique could be applied to a document 
composition system to speed up tine-breaking and 
other macroscopic document operations such as 
pagination, string search and replace, and spelling 
confection. This invention can also be used for 
improving the performance of interactive operations 
such as displaying typed-in characters, scrolling 
through the document and resolving mouse clicks 
to a position in the document. The method is 
particularly efficient when hyphenated text is de- 
sired. Performance does not degrade when the 
algorithms are extended to support ligatures, ker- 
ned pairs and foreign text. This technique is ex- 
tremely well suited to German text which is more 
likely to contain long words, hyphenations, spelling 
changes which result from hyphenations, and 
words that must be hyphenated more than once. 

The method consists of parsing the text in a 
document into a sequence of "atoms" which can 
be words, punctuation marks or spaces, and as- 
signing a number (a "token") to each one. As an 



example, if the program assigns the token "301" to 
the word "of" the first time that word is encoun- 
tered, then it will continue to assign the same 
number "301 " to every other "of in the document. 

5 A table of properties is also constructed for 

each unique token In the document. The following 
is a list of the properties maintained for atoms 
other than spaces: 
text the characters In the atom 

10 lastfont: a code representing the display character- 
istics of the font used to compute the token prop- 
erties cached in this table 

displayBitMap: the bit map of the atom text In font 
lastfont 

T5 notPunctuation: a Boolean indicating whether the 
atom is a punctuation mark 
atomMetrics; a record containing the character 
count of the token and the width of the word in 
screen and printer units. This information Is derived 

20 from the font referred to by lastfont. 

tweakPoints: An anray with one entry for each break 
point in the token. If the entry is a hyphenation 
point, the enliy contains metric information for the 
portion of the word prior to the hyphenation point 

25 including the width of the hyphen. If the hyphen Is 
a hard hyphenation point inserted by the user then 
the width of the hyphen is not included. 
The token corresponding to a space is handled 
differently from other tokens. It does not have a set 

30 of properties associated with it since the rules for 
treating it are much different from those of other 
tokens. 

A text processing function can proceed by us- 
ing each successive token to access the current 
35 token properties. This greatly speeds up the al- 
gorithms that classically process the document a 
character at a time, as well as the text functions 
that interpret the document as a sequence of 
atoms. 

40 The line-breaking algorithm can use each suc- 
cessive token to access the metric information in 
the token properties. If the line width has been 
exceeded, the current line will usually be termi- 
nated at the previous token. If the text is to be right 

45 justified, the interword spacing can be sketched. 
Finally, if the line cannot be stretched far enough, 
the token corresponding to the overset token will 
be examined to determine if it can be hyphenated. 
The token-by-token method not only leads to 

so more efficient tine breaking, but also speeds up 
other editing functions that depend on the docu- 
ment being interpreted as a series of atoms (e.g. 
words, spaces and punctuation marks). With spell- 
checking, for example, no matter how many times 

55 a word is used in a document the spelling of that 
word need only be checked once, since the same 
token will be used for each instance of the word. 
The algorithm proceeds in two phases. First the 
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algorithm checks all entries in the atom table. Then 
it scans the document for contiguous or fragment- 
ed words. 

Inserting and deleting characters as well as 
entire tokens from text encoded with this method is 
very efficient. In the case of individual characters, a 
special token is employed ttiat can bte quickly 
modified. Inserting and deleting entire tokens is 
even faster than individual characters since the 
operations involve only modifying the string of to- 
kens. Screen painting during type-in is very rapid, 
since all of the operations typically involved with 
updating the document data-structures, determining 
the new line-endings, and painting the text on the 
screen all t>enefit from this technique. 

The search and replace functksn also benefits 
from having to process only the text in the atom 
table, tf it is searching for one word. If it is search- 
ing for multiple words it need only scan the docu- 
ment for sequences of atoms, rather than se- 
quences of characters. 

This method of encoding text also leads to a 
very compact external format. K is possible to 
segregate the token properties into bask: properties 
and derived properties that can be computed from 
the basic token properties (i-©- the characters in the 
token and the location of txeak-polnts). Only the 
basic token properties need to be written out on 
the tile. When a new edit session Is started with the 
file, the basic properties are used to add the de- 
rived properties. 

The present invention will now be described, 
by way of example, with reference to the accom- 
panying tables and drawings. In which: 

TABLE 1 is tiie source file defining tiie token 
data structure, written in the Mesa programming 
language; 

TABLE 2 is the encoding of the text In 
TABLE 4 using the data structure defined in TA- 
BLE 1; 

TABLE 3 is the source file defining the data 
stojctures to represent token properties, written in 
the Mesa programming language; 

TABLE 4 is a sample text passage used in 
the examples: 

FIGURE 1 is tiie memory layout for the 
token properties defined in TABLE 3 for a token 
that has no break points; 

FIGURE 2 is the memory layout of the token 
properties defined in TABLE 3 for a token that has 
two break points; 

FIGURE 3 is a diagram of the parameters 
defining the display bitmap portion of the data 
structures in TABLE 3; 

FIGURE 4 is a diagram showing the order in 
which tokens are processed when hyphenation is 
enabled, using text from TABLE 4: 

TABLE 5 is the source file for the definitions 



of the line-breaking algorithm tiiat processes text 
encoded as tokens, written in the Mesa program- 
ming larYguage: 

TABLE 6 is file source file for the implemen- 
s tation of the line-breaking algorittim for breaking 
lines of text encoded as tokens, written in die Mesa 
programming language; 

TABLE 7 is the equivalent of the Mesa 
source code in TABLE 6 written in the C program- 
w ming language. 

TABLE 8 is the result of the algorithm de- 
fined in TABLE 6 on the first tiiree lines of the text 
in TABLE 4; 

FIGURE 5 is the text in TABLE 4. witii a 
IS portion of the text highlighted to represent a selec- 
tion: 

TABLE 9 is the initial fragment of the en- 
coded text in TABLE 2 that remains after the text 
selected in FIGURE 5 is deleted, and 

20 TABLE 10 is the final fragment of the en- 

coded text in TABLE 2 that remains after the text 
selected in FIGURE 5 is deleted. 

Encoding text using tiie method of this inven- 
tion consists of parsing tiie document into atoms 

25 and buikting arrays of tokens that correspond to 
the atoms. A small number of entries in tfie arrays 
are not tokens. These are special entries tiiat are 
required for encoding infrequent sequences of 
characters (Kke consecutive spaces) and for encod- 

30 ing very large documents. 

The text in Table 1 consists of the type detink 
tions for the data sft-uctures needed to encode text 
into tokens. The computer language used in Table 
t Is Mesa. Mesa is similar to Pascal and Modula II. 

35 The directory clause declares that the type Offset 
from tiie interface Token is used in 
LineBreak.mesa. Next.ttie file Is declared as a 
DERNITIONS file since tiie function of tiie file is to 
define data-types and procedures. 

40 The data structure defining the encoded array 
of tokens is EncodedText. Each element in the 
array is an Entry. Each Entry in the encoded text 
fits in one word of memory. The Entry is a record 
that has two variants; a token or an escape. 

45 The token variant consists of two fields: the 
numt)er con-esponding to tiie atom ttie token refers 
to, and a Boolean term indicating whether or not a 
space follows the token. To maximize tiie perfor- 
mance, the token assigned to each atom is chosen 

50 in such a way as to allow It to be used to deter- 
mine the location in memory of ttie properties for 
that token. 

The escape variant of the Entry record is itself 
a variant record. This variant is used to encode 
55 information that cannot be represented with token 
entries. 

The changeBase escape variant is required to 
encode large documents. Since the offset in the 
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token variant consists of only 14 bits, the number 
of tokens that can be directly addressed is linnited. 
The changeBase escape allows the address space 
for token properties to be changed, and thus allows 
a very large number of token properties to be 
addressed. 

The space escape variant is used to represent 
consecutive spaces. This is needed because the 
token entry can only encode whether a single 
space is following an atom. The space variant has 
no parameters. 

The zeroWidthSpace escape variant is used to 
represent a non-hyphenating break point in an 
atom. This is a seldom-used feature in the XICS 
markup language. The zeroWidthSpace variant has 
no parameters. 

Table 2 Is the list of tokens that woukJ be 
created from the text of Table 4. The table consists 
of the contents of each entry of the encoded text 
one entry per row of the table. For the sake of 
clarity, the text for each token is included imme- 
diately to the right of each token Entry in Table 2. 
The first row of Table 2 contains an escape Entry 
that is a changeBase variant It sets the address 
space for the token properties to the first address 
space. The address of the token properties is com- 
puted by combining the base address defined in 
the changeBase entry and the offsets in the token 
entries. The second row of the table contains a 
token variarrt entry. There are two items for each 
atom: a number identifying the atom» and a t»t 
indicating whether there ts a space after the atom 
In this coding scheme, the space between atoms is 
part of the preceding atom. For example, the first 
word "The* is assigned the values 
[spaceFoltows: TRUE, offset: 1]. 
Similarly, the second word, "approach" is 
[spaceFoltows: TRUE, offset: 201- 

The ninth entry is the word "the" again, but 
with a lower case "t". This atom can not be given 
the same token as the original "The" which had a 
capital T" because the widths of the characters 
will not usually be the same. The 17th entry is a 
left parenthesis. It is coded as 
[spaceFollows: FALSE, offset: 323]. 
The 29th item on the list, the word "or" is the 
same word as the 18th word on the list. They will 
have identical properties and will therefore use the 
same token. Likewise, the commas of entries 26 
and 28 will have the same tokens. 

Each token has an associated list of properties, 
as shown in Table 3. The set of properties is made 
up of two records: the Tokenprops.Object and an 
instance of the record Tokenprops-Trailer, Both of 
these records are declared as MACHINE DEPEN- 
DENT to force the Mesa compiler to pack the fields 
into as few words of memory as possible. By 



ordering the values in terms of decreasing fre- 
quency of access, the numt>er of memory referen- 
ces needed to access tfie token properties could 
be minimized. Since in the Mesa programming 

s language indeterminate length arrays can be lo- 
cated only at the end of a record, two records were 
required to achieve the optimal order. 

The first field in the Object record is a Boolean 
term which indicates whether the atom is a punc- 

10 tuation mark. This field is used during line-breaking 
computations to detenmine where legal breaks can 
occur. 

Next is a numt>er identifying the style, size, 
stress and weight of the font This number repre- 
ss sents the last font from which the values of the 
atomMetrics. breakpoint array (if present) and the 
display bitmap were computed. Thus if the cun^ent 
font in which the atom is being processed is the 
same as the last font in whfch the atom was 

20 processed, the values in the property records are 
simply accessed and not recomputed, since the 
values are still correct Othenwise, the values in the 
property records must be recomputed prior to pro- 
cessing the current token. 

25 The third field is called the AtomMetrics and is 
also defined in the Tokenprops interface. This 
record contains the metric information for the entire 
atom. The values in the AtomMetrics record are the 
length of the atom in micas (a machine- trxjepen- 

30 dent unit defined as) and pixels (screen units) and 
the length of the atom text in bytes. English text 
would typically require one byte for each character, 
but more bytes per character may be required in 
another language or for representing special tech- 

3S nlcal symbols. See *The Xerox Character Code 
Standard* for a method of encoding international 
J characters and special symbols that woukj require 
more than one byte per character. 

Following the atomMetrics is the br^akPoin- 

40 tCount, which corresponds to the number of places 
an atom can be broken between lines. Break points 
are usually determined by hyphenating a word. A 
word may also include manually-inserted zero- 
width spaces. In this encoding technique, the atom 

45 "the" has no break points. The last field in the 
Object is the breakpoint array. This array may be 
omitted altogether if there are no break points in 
the atom. Rgure 1 shows the memory layout of the 
properties for the atom "the". If there are break 

60 points in the atom, each element of the breakpoint 
array will consist of the first parts of the divided 
words that can be formed by hyphenating the 
original word. For example, the three-syllable word 
"document" will have two break points: information 

55 to describe "doc-" and "docu-". This is shown in 
Figure 2. 

The ftrst element in each breakpoint array en- 
try consists of the break point type, which is a 
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pyameter representing how desirable the break 
point is. This is used by the line-breaking algo- 
rithm. The byte count is the byte count of each 
alternative. Length Is the length of each alternative 
in machine-independent units and in pixels. 

The second record describing the properties Is 
the objectTrailer record. The trailer record is al- 
ways present in the token properties. 

The first fiekJ in the obiectTrailer is the referen- 
ceCount. This is used by the editor's dynamic 
memory manager to determine when to free the 
space for a set of properties. Typically this is done 
wt>en the reference count reaches zero. 

The next field is the raster. The raster is a 
record called RasterData and is the information that 
describes the cached screen resolution image of 
the atom. The first item, bpl, is the total pixels in 
each horizontal scan line. If the bit map. for exam- 
ple, is divided into sixteen-bit words, this number 
will typically t>e a multiple of sixteen bits even if 
the Image of the atom occupies a smaller •'width" 
as for the individual letter "i". The actual width of 
the image in pixels Is stored in the atomMetrics. 
The second item Is the total number of words of 
memory that are required to store the entire bit 
map for this atom. For example, if the wWth in 
pixels of an atom Is thirty bits and an Image is 14 
pixels high, then the bitmap will require 28 words. 
This number is derived as follows: 
height Tapl/bits-per-word 

where the bpl is defined, by conventk>n, to be the 
smallest multiple of the bits in a word of memory 
that is greater than or equal to the width of the 
image. The height is the height of the Image. The 
baseline is the number of scaniines down from the 
top of the image to the scanllne that the text will 
appear to sit on. Bits is a two-word pointer to the 
first word of the image. See Figure 3 for a graphic 
representation of the fields in the RasterData 
record. 

The final field of the Trailer record is called 
text. This is an array containing the characters In 
the token. The array length is stored in the Atom- 
Metrics, defined above. To optimize searching for 
tokens, the number of bytes allocated for the text 
field may actually exceed the number of characters 
in the token. The number of characters allocated 
will usually be a multiple of two or four, depending 
on the particular machine the algorithm is imple- 
mented on. 

Table 5 is the Mesa source code which defines 
the data and procedures for computing line breaks. 
The directory portion of the file defines the ele- 
ments of other interface files that are referenced in 
LIneBreak.mesa. LineBreak-mesa is declared as a 
DEFINITIONS file since the function of the file is to 
define data-types and procedures. 

SufflxChar is a type that defines possible final 



characters that can appear at the end of a tine of 
text This is required for the display algorithm. 
SuffixChar is declared as machine-dependent to 
ensure that the values assigned to each element in 
5 the enumerated type are consecutive and begin 
with 0. 

Reason is an enumerated type that lists the 
possible reasons that the line-breaking algorithm 
can retum. Reason is also machine-dependent 
70 since the line-breaking algorithm depends on the 
particular values the compiler assigns to each ele- 
ment in the enumerated type. The following table 
defines each of the values of Reason: 
margin 

IS the current token has exhausted the line measure 
and a line-breaking decision has been made. This 
means that a line break was identified that satisfied 
all of the constraints placed on the algorithm, 
normal 

20 the current block of text has been exhausted with 
no line-breaking decision toeing made. 
changeBase 

the current token is a changeBase escape 
invalidProps 

25 the properties for the current token are out of date 
and need to be recomputed with the current font 
contiguousWocds 

the current token is not preceded by a space or 
punctuation mark. This usually implies that the two 

30 tokens are fragments of a single word. This result 
enables the client code to adjust the metric in- 
formation for kerning and letterspacing, as welt as 
to keep track of the beginning of the fragmented 
token in case a sequence of token fragments 

35 needs to be hyphenated. 
unableToBreak . 

a llne-breakir^ decision could not be made even 
though the current line measure has been reached. 
The most common event that causes this result is 
40 that the token that overruns the margin is a punc- 
tuation mark. 
specialGermanCase 

this reason is returned when the line-breaking al- 
gorithm attempts to break a token that requires 
45 respelling at the desired hyphenation point. 

TwelveBits is a type that defines a twelve-bit 
unsigned integer. It Is used in the stateRec record 
that is described below. 

ArgRec is a machine-dependent record which 
so is the argument to the line-breaking algorithm. It is 
machine-dependent because, where possible, sev- 
eral fields in the record are packed into a single 
word of memory. 

The first field in the record is called text, which 
55 is a descriptor representing the array of tokens to 
be processed. The descriptor occupies three words 
of memory, with the first two words consisting of a 
pointer to the first token and the final word defining 
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the length of the array. 

The field propsBase is the base address used 
to resolve the relative pointers embedded in the 
token entries of the encodedText array. See Table 
2 for an example of an encoded text passage. After 
the propsBase Is a Boolean term called hyphenate. 
If 'hyphenate* is TRUE then the line-breaking al- 
gorithm will attempt to hyphenate the token that 
crosses the line measure; otherwise the algorithm 
backs up to the last complete token that fits in the 
measure. The next field represents the style, size, 
stress and weight of the font being processed. If 
the font field in the ArgRec does not match one of 
the font fields in the lokenProps. then the algorithm 
returns with a reason of invalidProps. The field after 
font is the margin, which is the line measure In 
mk;as. 

The next two fields in the ArgRec are the 
hyphenPixelLength and the minSpacePixelLength. 
These, respectively, define the width a hyphen and 
the minimum width a space can be in screen units 
(pixels). The next two fields are the width ofa 
hyphen and the minimum width of a space in 
micas. 

The whiteSpace field defines the maximum 
size of a sp^ce in micas. It is not neccessary to 
define also t te maximum size of a space in pixels, 
because only the mica measure is used in making 
a tine-breaking decision. 

The final two fiekis in the ArgRec are called 
'final' and 'prior'. Both of these are instances of the 
LlneBreak.State record. These fiekis will be re- 
ferred to in Mesa notation as 'arg .final' and 
arg.prior*. respectively. These values in these 
records are used by the line-breaking algorithm for 
tenDporary storage and to record the result when a 
line-breaking decision is made. The values In the 
3rg.prior represent the last break point passed in 
the current block of text Similarly, arg.final con- 
tains the data for the final token tiiat was processed 
before the cunrent exit of the line-breaking algo- 
rithm. If a line-breaking decision has been made, 
then the values in arg.prior contain the values for 
the end of the current line, and the values in 
arg.final are those to begin the next line. 

The first field in the LineBreak.State record is 
the index. This is the index of the current token 
relative to the beginning of the encodedText. The 
micaLength and the pixel Length are the cumula- 
tive widths in micas and pixels, respectively, of the 
tokens allocated to the current line. Note that these 
are specifically not the cumulative widths of the 
tokens in the current text block. The next field Is 
the count of the blanks encountered on the current 
line. After the blank count is a Boolean term called 
'notPunctuation'. This field indicates whether the 
last token was a punctuation mark or not. This field 
is used to determine the locations of legal break 



points in a line. The suffixChar is a code represent- 
ing the last character on a line after a line-breaking 
decision is made. The possible values of suffixChar 
wore previously defined in the enumerated type 

5 SuffixChar. The byteCount field is the total number 
of bytes in the tokens that have been allocated to 
the cunent line. The final field, whiteSpace, is the 
maximum amount of white space that the , line- 
breaking algorithm can allocate to the line when 

70 making a line-breaking decision. 

ArgHandle. Argspace. and argAIignment are 
three types that define data structures needed to 
align the ArgRec in memory in such a way as to 
avoid crossing a page boundary. This invariant is 

75 an optimization used by the micro-code implemen- 
tation on the Xerox 6085 workstation. Since ArgRec 
Is 23 words long, the record must start on a 32 
word boundary to guarantee tfiat it does not cross 
a page boundary. Argspace defines a block of 

20 memory 55 words long - which guarantees that 
there is a 32 word boundary in the range, with 
sufficient space after the boundary to contain the 
entire record. UneBreakAlignArgRec is a proce- 
dure that takes as an argument a pointer to one of 

26 the 55 word blocks of memory, and returns a 
pointer to the 32 word boundary that is contained 
in the block. 

LineBreak.SoftwareUneBreak and 
UneBreak.UneBraak define the Mesa and micro- 

30 code versions of tfte line-breaking algorithm that 
will be defined In the next section. The two proce- 
dures return the same results, the difference being 
that the latter is implemented as a custom machine 
instruction on the 6085. The argument to both of 

35 these procedures is a pointer to tfie ArgRec, and 
both return a LineBreak.Reason. 

Table 6 is the contents of the file 
LineBreaklmpl.mesa. This file supplies the imple- 
mentation of the interface UneBreak. The fist of 

40 names in tiie DIRECTORY section of tiie file de- 
clares the names of interface files that are referen- 
ced in LineBreaklmpl. In the next source statement, 
LineBreaklmpI is declared as a program (rather 
ttian a definitions file as before) which uses 

45 (IMPORTS) a subprogram from ttie interface 
Frame, and shares (EXPORTS) a program via the 
interface LineBreak. A constant null Data is de- 
clared immediately after the PROGFIAM statement. 

so 

The Linebreaking Algoritiim 



SoftwareLineBreak computes line endings 
55 based on (he margin stored in the argument arg. 
The program is designed to handle cases that 
require more than one block of text in the same 
line. The design is also independent of the method 
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that the application using the program uses to store 
the text. The main procedure is organized to in- 
clude four nested subprograms. 

WholeWordBreak is the logic that Is used when 
a line-ending decision is made at the end of a s 
whole word or at a space. WholeWordBreak initial- 
izes data in arg .final to refer to the beginning of the 
token after the one that ends the current line. This 
procedure is declared as an INLINE procedure to 
indicate that it is to be treated as a macro, which w 
implies better performance for short procedures. 

SaveThisBreak is a procedure that transfers the 
contents of arg.final Into arg.prior. The logic may 
need to revert back to the values in arg.prior if the 
next token can not be added to the current line. t5 
This procedure Is also declared to be an INUNE for 
the sake of performance. 

ProcessSpace checks to see If the cunnent 
space will Tit on the line. If not, then processing is 
tenminated and the routine returns a TRUE value. If 20 
the current space fits, then the data In arg.ftnal Is 
modified to reflect the values at the current point. 
In this case, the routine returns a value of FALSE. 
ProcessSpace is also treated as a macro to op- 
timize performance. 25 

Hyphenateword is the subprocedure that is ex- 
ecuted when the margin is crossed by a token. The 
algorithm branches to the unableToBreak exit 
dause if the last token that was processed was not 
a punctuation character. This exit results in the 30 
reason being set to unableToBreak. In this context 
a space Is considered as a punctuation character. 
Similarly, if there are no break points to choose 
from, or if hyphenation is not selected for the 
cunrent block of text, then the algorithm exits as 
through a branch to the noneRtsExit clause. If this 
path is executed, then the last token processed is 
selected as the break point and the values of final 
are initialized with the procedure WholeWordBreak. 

At this point the algorithm has determined that 40 
hyphenation points exist and that it is appropriate 
to try to select one of the break points. Before 
entering ttie main loop two variables are initialized. 
The first is a pointer to an element of the break- 
point array. This variable is initialized to point to 45 
the first element of the break-point array in the 
token being broken. The second variable is the 
minimum width that the text Including the portion of 
the final token must be to satisfy the white space 
constraint so 

The main loop in Hyphenate Word selects the 
best possible break point from the break points in 
the list. The algorithm requires that the break 
points be sorted on two keys: first in descending 
order of priority {in other words the most-desirable 55 
breaks come first) and within each class of priority, 
decreasing order of position in the token. This 
ordering allows the algorithm to detect the optimal 



break point with a single pass over the break 
points. The optimal break point is the one with the 
highest priority and that results in the tightest fit for 
the line. 

Three exceptions cause the algorithm not to 
select the break point based on the highest priority 
and tightest-fit criteria. The first exception is for a 
special case break point such as a German word 
that would require respelling if the break point were 
selected. The second exception is if there is a 
manually-inserted or "hard" hyphen. In that case 
the manually-inserted hyphen is chosen whether or 
not it gives the best fit. The third exception is if all 
of the break points are too large - that is none 
results in at least the minimum whitespace - then 
the algorithm terminates without selecting a break 
point and the boundary of the previous token is 
used for the termination of the line. 

If a suitable break point is selected. 
HyphenateAord exits through the successExit 
clause. The first four statements of the clause 
update ar9.final and arg.prior fields, respectively, to 
reflect the break point that has been selected. 
Next arg.final Is updated to represent the part of 
the token beginning the next line. The manner in 
which arg.final Is updated depends on the type of 
the selected hyphenation point If the break point is 
a synthetic hyphen generated by the break-point 
togic, the break-point metrics must be adjusted 
since the hyphen is not actually part of the token. 
The most common hyphens that are not synthetic 
are hard hyphens. 

When any form of hyphen is chosen to termi- 
nate a line, micaLength. pixelLength, and 
byteCount are all assigned negative numbers re- 
presenting the portion of the token ending the 
previous line. The value of Index remains un- 
changed, so that the same token begins the next 
line. When the negative values in arg.final are ad- 
ded to the values for the entire token, the results 
are the proper values for the portion of the token 
beginning the new line. 

The main part of Software LineBreak begins 
with the initialization of pointers to arg.final and 
arg.prior. The main loop is executed as long as 
arg.final.index is less than or equal to the length of 
the cunrent text block, arg.text. The processing that 
is done on each token entry in arg.text depends on 
the type of the token entry. 

If the type is a token entry, then the first 
statement in the token branch of the variant select 
statement removes references from several of the 
levels of indirection to speed processing. Next, the 
font that is desired for this text block is compared 
to the font that was last used to compute the atom 
metrics. If the font is incorrect then the main loop 
exits through the invaiidProps clause which in turn 
causes a return from the procedure with a reason 
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of invalidProps. This gives ttie us an opportunity to 
recompute tlie token properties and restart the 
procedure. If the properties are still valid then the 
computation is made to determine whether the 
entire token fits on the current line. If it does not, 
then the loop exits through the margin Exit clause, 
and hyphenation is attempted. If the token fits, then 
a check Is made as to whether the algorithm has 
encountered two consecutive tokens without either 
a space or a punctuation mark. If so then the loop 
is exited through the contiguousExIt clause. 

At this point it has been determined that the 
cunent token fits on the tine and that it can be 
added to the line. The next four lines of the main 
loop update arg .final to reflect this. The final state- 
ment In the clause checks whether the spacefoh 
lows Boolean term is TRUE for this token. If It is, 
then the ProcessSpace subprocedure is executed. 
If the space does not fit on the line, then the main 
loop is exited through the simpleMarginExtt. since 
it is clear that hyphenation is not needed. If the 
space fits, then the token branch of the select 
statement is exited, finalJndex is Incremented, and 
the next entry in the text bh>ck is processed. 

The other branch of the variant select state- 
ment in the main loop is executed if the text entry 
is an escape variant of the Token. Entry record. As 
was described in Table 1, the escape variant ts 
itself a variant record <refered to as an anonymous 
record in Mesa). Therefore, the escape anm of the 
select statement is also a select statement As 
"mentioned previously, the three variants of this 
record are space, zeroWidthSpace. and chan- 
.geBase. 

Processing for the space variant consists of 
executing the ProcessSpace subroutine, ff the 
ipace does not fit on the line then the main loop is 
exited through the simpleMarginExit clause. When 
a zeroWidthSpace is encountered the fi- 
nal,suffixChar is updated to reflect this. The cun'ent 
position on the line is made a viable break point in 
case the next token does not fit. A changeBase 
escape causes an exit from the main loop through 
the ChangeBase exit clause. No other processing is 
needed, since the us updates all of the necessary 
data structures Including arg.propsBase. If the pro- 
cessing on the escape entry is completed without 
an exit being executed, then the clause is exited, 
final.index is incremented and the processing of 
the next text entry is begun. 

If the entire text block is processed without 
making a line-ending decision, then the procedure 
exits through the FINISHED exit clause of the main 
loop. 

Table 8 shows the values in the argRec prior to 
beginning processing of the text in Table 4. and 
then after each of the first three lines that were 
computed. In the first instance the value for text is 



the descriptor that represents the encoding in Ta- 
ble 2. The values have been deleted for the sake of 
brevity. The values for font, margin, hyphenPixel- 
Lenght, mInSpacePixelLength, hyphenMicaLength. 

6 minSpaceMicalLength and whiteSpace are all set 
for a 10 polntserif font The values of final and prior 
are all set to initial null values since no tokens have 
been processed. 

The second value of arg In Table 8 corre- 

10 spends to the values in arg after the first line has 
been computed. Only the values In final and prior 
have changed. The finalJndex and the priorJndex 
are the same, since the last token on the first line 
was hyphenated. Notice that in Table 2 the tenth 

75 entry (starting with the first element as number 
zero) conresponds to the token "document". The 
values in arg.prior correspond to the accumulation 
of the first ten tokens plus the portion of the 
eleventh token through "docu". The negative num- 

20 bers in arg.final correspond only to the portion of 
the last token included in the values of arg.prior. 
When these values are added to the first token of 
the next line (which will be the values for the entire 
word "document") the result is that the proper part 

25 of the word. "menf*. Is added to the line values. 
This is shown graphically in Rgure 4. By process- 
ing the same token twice when the line-breaking 
decision requires splitting a token across two lines, 
the code that calls SofhvareLineBreak can be wri^ 

30 ten with \oq\c that is largely independent of whetlier 
hyphenation has occunred. This invariance results 
in very efficient pagination code. 

After the second ilne is computed, the values 
in arg will be those In the third value of arg in 

35 Table 8. Notice that in this block the value of 
margin changed to reflect that the second line was 
not to be indented. Since the second line is not 
hyphenated, the values for arg.final.index and 
arg.prtor.index are different. This is appropriate 

40 since the second line ends on entry 21 which is the 
token "each", but the third line begins with entry 
22 "unique". Since the final token of the second 
line is not hyphenated, there are no can-yover 
values from line to line as there are with the first 

45 line. Therefore, all of the other values of arg.flnal at 
this point are null. 

The final value of arg in Table 8 corresponds to 
the third line of Table 4. This line started with a 
complete token and ended with a complete token. 

50 Table 7 is the equivalent of the program in 
Table 1, Table 3, Table 5 and Table 6. written in 
the C programming language. This program will 
run on any computer that provides a C compiler, 
including any member of the Sun Micro Systems 3 

55 and 4 family of minicomputers. This code has also 
been run on the DEC Vax 7xx family and microVax 
computers. 
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The Display Algorithm 



There are two main differences between the 
iine-breaking and display algorithms. Rrst, rather s 
than accumulating the token metrics to determine 
where a line ending is, this information must have 
been previously computed and passed into the 
display algorithm. The infonmation concerning the 
number of tokens (or parts of tokens) and the lo 
cumulative metric information is required at the 
beginning of the display process. This information 
is required so that if justification is desired, the 
appropriate adjustments can be made to the place- 
ment of the tokens. Similarly, when certain text- is 
formating modes are desired, the display algorithm 
must have available to it the width of the tokens to 
be painted before the first token is displayed. Ex- 
amples are 'ragged left* mode and modes where 
the text is centered on a column, page or arbitrary so 
measure. The second difference is that the display 
algorithm accesses the RasterData in the Token- 
props.TrailerObject (Table 3) to accumulate a dis- 
play bitmap of a line of text This is done using the 
BitBIt machine instructton on the Xerox 6085 Work- 2S 
station which is similar to raster processing support 
on most machines. The description of the bitmap 
fhtat this instruction uses for the source bitmap 
includes both a width and a bit offset into the 
bitmap. These parameters allow the painting of 30 
only a portion of a token bitmap when a token is 
broken between two lines, usually by hyphenation. 
Consider an example where the width of a token 
bitmap is 30 bits, and the token is hyphenated at a 
break point 17 trfts into the image. To paint the as 
portion of the Image prior to the break point, the 
width would be set to j7 and the offset would be 0. 
To paint the portion after the break point, the width 
would be 32 minus 17 or 15 bits, and the offset 
would be 17 bits. See line 2 of Table 4 and Figure 40 
4 for an example of when this is important. 

Painting text a token at a time can now be 
compared to the same process done a character at 
a time. In both cases the same number of bits of 
font (or text image) information is passed to the 45 
display, but the performance is significantly faster 
in the case of a token at a time. The difference in 
performance is largely a result of the smaller num- 
ber of times that each word of memory storing the 
(system) display bitmap needs to be accessed, so 
The numt>er of memory references is. much larger 
when the text is displayed a character at a time. 
The width of a typical character bitmap is on the 
order of four to six bits for font sizes normally used 
for document body text (i.e. text that is not used for 55 
special functions such as headings and footnotes). 
This Implies that, for machines with sixteen-bit 
word lengths, each word of display memory is 



accessed three to four times while the bitmaps of 
contiguous characters (as in a word of text) are 
written into tiie display bitmap. If a machine has a 
memory word lengtii of thirty-two bits then each 
word of memory will be accessed six to eight times 
while an image is assembled. 

The performance difference between painting 
text a character at a time and a token at a time is 
even more significant if the text includes characters 
that require the overstriking of several characters or 
portions of characters. Examples of where this 
couhj occur is accented characters and mathemat- 
ical symbols that are not normally included In fonts. 
Once the images of these special characters or 
symbols are generated and inserted into the token 
properties, they can be reused as long as the font 
does not change. 



Modifying Documents 

This invention will now be described as applied 
to a WYSIWYG document editor used for publish- 
ing applications. It could also be used for other 
text-processing applications. The documents pro- 
cessed by tfiis system comprise a sequence of 
elements consisting of formatting commands (or 
maricups) interspersed with fragments of text. The 
fragments of text are in turn sequences of tokens. 
Each element in the document is called a piece, 
and tiie entire data structure is called a piece table. 
A piece that represents a Token. Entry vector (and 
thus has content that is text) is referred to as a text 
piece. To facilitate the modification of the docu- 
ment, the individual pieces are linked in both direc- 
tions with pointers. 



The Selection 



The user modifies a document by identifying a 
portion of the text and then invoking one of tiie 
editing operations normally implemented in a 
WYSIWYG editor or text-processing system. The 
portion of the text that is operated on is called the 
selection. The selection may consist of as little as a 
single character, a series of contiguous characters 
or words, or even the entire document. The selec- 
tion is usually indicated by changing tite appear- 
ance of the text. On systems with a high resolution 
screen the preferred metiiod is displaying the se- 
lection in reverse video. 

In addition to having a beginning and an end, 
the selection also has an insertion point, it is usu- 
ally at the end of the selection, but can also be at 
the beginning of the selection. The insertion point 
is typically identified with a 'blinking' caret. The 
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piece table entry that contains the insertion point is 
defined as the input focus. Even though visually 
the input focus may appear to be attached only to 
text, it is entirely possible that the insertion point 
may actually be attached to a piece table entry 
containing commands. 

The operations a user can perform on a selec- 
tion are to delete it, backspace over one or more 
characters or words beginning at the insertion 
point or add characters after the insertion point 
Backing up in units of words is referred to as a 
tjackword operation. In each case the algorithm 
that operates on text represented with this inven- 
tion is somewhat different from the equivalent al- 
gorithm for processing a document encoded as a 
series of characters. 

Once one of the operations mentioned above is 
performed on a selection, the selection is deacti- 
vated. This Is indicated to the user by deSm- 
phasizing the text In the selection, if any remains, 
after the operation. 



Resolving Mouse Clicks 



Portions of the text are selected by converting 
the coSrdinate information from a pointing device 
into a position in the document This process Is 
called resolving the positron in the document Virtu- 
ally all interactive WYSIWYG systems have sonrw 
form of pointing device. Most commonly^ the point- 
ing device is a mouse. Much less frequently It Is a 
set of cursor keys. 

The pointing device provides coSrdinate in- 
fonnation at regular intervals while the devfce is 
active. The process of converting the ^o6nJinate 
infomiation into a positton in the document is facili- 
tated by this invention. 

The resolving algorithm proceeds in three 
phases. Rrst the line of text containing the coSrdi- 
nate is located; next the specific token is identified; 
and finally the character is located. The editor must 
provide some way for the resolving software to 
determine which line of text contains the new co- 
tirdinate. Typically this is done with a table of line 
boxes. The information describing the line box is 
the location (or orientation) of the line, dimension of 
the line, and Ijeginning offset of the first token on 
the line. 

Once the line is located, the resolving algo- 
rithm can enumerate the tokens in the line to 
determine in which token the coordinate is located. 
At this point the individual characters can be in- 
spected to determine in which character the co- 
ordinate falls. 

The table of line boxes is essential if the editor 
design supports multiple line-setting modes. If the 



text is set with variable-width spaces, or text that is 
not aligned along the left skje of the column, then 
the Information related to the number and width of 
the spaces must also be included in the table. On 
5 fast processors it is possible to replace the tine 
table with a procedural interface to the pagination 
logic that returns the information in the nth line box 
by paginating the appropriate page through the nth 
line. 

TO By proceeding across the line a token at a time 
rather than a character at a time, as is classically 
the case, the resolving process proceeds much 
faster. In particular, with complex text, such as 
international or mathematical text that cannot be 

IS represented by a single byte per character, the 
method described here will be significantly faster. 



The Delete Algorithm 

20 

The algorithm for deleting a portion of a docu- 
ment encoded as a sequence of tokens is very 
similar to the algorithm for deleting text stored as a 

25 sequence of characters stored in a piece table. In 
the general case, a deletion is perfonned on a 
selection consisting of more than one character 
where the first and last character in the selection 
are not in the same piece table entry. The simplest 

30 case is when the begining of the selection is just 
after a space (the beginning of a token) and simi- 
larly, when the end of the selection is just before a 
space. In this case no new tokens are created as a 
result of the deletion, but as many as two new 

35 pieces are created, depending on whether the be- 
ginning and end of the selection are in the same 
piece of text. The encoding of the space imme- 
diately after the selection will have to be changed 
to the space escape variant of the Tokert.Entry if 

40 the space is encoded using the spaceFollows Bool- 
ean term in the last token of the selection. If one 
end of the selection falls on one of the interior 
characters of the first or last token, then new to- 
kens will result from ttie deletion. If this is the case. 

45 then a changeBase escape Token. Entry may need 
to be added to the resultant pieces to reflect the 
base address of each new token. 

The text in Rgure 5 contains the same text that 
was previously shown in Table 4. The highlighted 

50 portion of the text represents a selection that will 
result in several new tokens being created when 
the selection is deleted. For the sake of illuminating 
the most general case, it is assumed that the new 
tokens will be located in a different area of memory 

55 from the area used for the existing tokens (base 
address number 1). The deletion will result in two 
* pieces, one for the portion of the text prior to the 
selection, and a separate piece for the portion after 
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the selection. Table 2 contains the piece table 
entry for the remaining portion of the paragraph 
prior to the deletion. Table 9 shows the beginning 
and end of the initial fragment of the text (The 
missing portion of the piece corresponds to the s 
portion of the piece in Table 2). Notice that a 
changeBase escape Token. Entry is required be- 
cause the new token is not in the same token 
props area that the other tokens are in. Table 10 
shows the tokens in the second piece table entry. to 
Again, the new token is preceded by a changeBase 
escape Token. Entry. A second base address 
change is needed in this piece to set the base 
address back to the first address space used for 
the remaining tokens. is 

The Modifiable Token. Entry Anray and Token Prop- 
erties 

20 

Piece table entries (and the contents of the 
entry) are typically of fixed lengtti to minimize 
storage requirements. Insertion, backspace and 
backword all result in changes to the content of the 2S 
piece table entry containing the insertion point. It is 
impractical after each operatian to alk>cate new 
space for the mocfified contents, copy the content 
of the input focus to the new space, update the 
contents of the new piece to reflect the change, 3o 
and deallocate the old contents. Performing these 
operations repeatedly woukl result in fragmented 
dynamic memory and inadequate performance. A 
way to avoid this fragmentation is to identify tran- 
sient resources that can easily be modified. Spe- ss 
dfically. a preSllocated Token.Entry an^ay that is 
arbitrarily long could be substituted for the actual 
contents of a piece when one of these operations is 
selected. The length of the array simply has to be 
adjusted at the end of each operation to reflect the 4o 
changes. 

When the first operation requiring the modifi- 
able entry array is invoked, content of the input 
focus is copied into the modifiable entry array and 
a reference to the modifiable content is placed in 45 
the input focus. The editing algorithm is then in- 
voked. The contents of the input focus remain In 
the modifiable space area until an event transpires 
signalling that no further operations on the modifi- 
able piece can occur. Events of this nature include so 
the user endirtg the edit session on the document, 
or moving the selection to another portion of the 
document. After the last edit is performed, ariy 
content remaining in the input focus is made per- 
manent by allocating a new space, copying the ss 
content of the modifiable space into the new space, 
and placing a reference to the new space in the 
input focus. 



Insert and backspace also require a modifiable 
token property area to avoid similar performance 
problems resulting from repeatedly allocating and 
deallocating token properties during rapid oper- 
ations. A reference to the modrfiable token property 
space is substituted for the last token variant in the 
modifiable Token. Entry anray when the first insert 
or backspace is invoked. The contents of the modi- 
fiable token properties are then either discarded or 
transferred to a permanent property space depend- 
ing on the sequence of operations. This will be 
described in more detail below. 

The modifiable Token.Entry array can be of 
limited length. In general, a length of one thousand 
tokens will seldom be exhausted, and this length 
imposes few constraints on the implementation. 
The modifiable token property space need be long 
enough to hoM only the maximum number of break 
points the implementation allows. 

By convention, the modrfiable token property 
space is defined to be offset one in base address 
zero. This convention will be used in the editing 
algorithms described below. 



Backspace 



When a backspace operation Is invoked, the 
algorithm checks on at which end of the input 
focus the insertion point is located. If the insertion 
point Is on the right-hand side of the input focus, 
then the algorithm proceeds with then next step. 
When the insertion point Is on the lefthand side of 
the Input focus, then the Input focus is moved to 
the next prior piece table entry that contains text 
The^insertion point is moved to the right-hand side 
of the input focus. If tfiere are no prior text pieces 
the algorithm terminates. 

Next, a check is made as to whether the con- 
tent of the Input focus has been copied into the 
modifiable entry array. If the content of the input 
focus has been copied, then the portion of the 
algorithm that deletes the last character is invoked. 
Otherwise, the content of the input focus is copied 
into the modifiable Token.Entry array. The first step 
in this process is to replace the address of the text 
in the input focus with the address of the modifi- 
able entry array. Now the algorithm can proceed to 
delete a character. 

Once the content of the input focus has been 
copied to the modifiable Token. Entry array, the 
backspace algorithm begins by locating the last 
content- bearing entry in the array. A content- 
bearing entry is a Token.Entry that is not a chan- 
geBase variant. The algorithm begins with the last 
element of the array and proceeds backwards in- 
specting each successive element This process 
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terminates when the first content-bearing entry is 
located. If the entry is a space escape Toi<en. 
Entry, then the token array is shortened by one to 
delete the space. The algorithm then terminates 
since a character has been deleted. Similarly, if the 
entry located by the scan for a content-bearing 
entry is a token Token. Entry with the spaceFoilows 
Boolean term set to TRUE, then deleting the char- 
acter consists of setting the Boolean tenm to 
FALSE. The content of the Input focus stays the 
same, since the final content-bearing entry was not 
eliminated but only changed. 

The final case is when the final content-bearing 
entry is a token Token. Entry that does not have 
the spaceFoilows Boolean term set to TRUE. If the 
properties for the final token are not already In the 
modHiable token property area, then the contents 
are copied and the entry array is updated to reflect 
this. In general, the content- bearing entry is re- 
placed with two entries. The first is a changeBase 
escape Token.Entry to set the base address to 
zero (the modifiable token property area), and the 
second is a token Token. Entry with an offset of 
one. RnaJly, the length of the content of the input 
focus is updated (usually by Incrementing the 
length by one). 

The final portton of the backspace algorithm is 
the procedure to delete a character when the final 
element of the modifiable entry array is a token 
Token. Entry which has already been copied to the 
modifiable token property area (from a previous 
operation) and the spaceFoilows Boolean TERM is 
already FALSE. The first step is to determine how 
many bytes need to be deleted to eliminate the last 
logical character. International characters, such as 
accented characters, may be represented by byte 
sequences longer than one byte. If the resulting 
deletion does not exhaust the bytes in the token 
properties, then the properties are updated and the 
algorithm terminates. If deletion exhausts the bytes 
in the distinguished token property area, then the 
length of the distinguished token array is reduced 
by one and the procedure to locate the first pre- 
ceding content-bearing entry array element is ex- 
ecuted. 



Backword 



The first two steps of the backword algorithm 
are identical to the backspace algorithm; If the 
insertion point is on the left-hand side of the input 
focus, then the content of the next prior piece 
containing text is set as the input focus and the 
insertion point is moved to the right-hand side of 
the new input focus. Next, if the content of the 
input focus is not in the modifiable entry an-ay, 



then the content is copied into the modifiable array, 
and a reference to the modifiable entry an-ay is 
placed in the input focus. 

Once the content of the input focus is copied 

5 to the modifiable entry array, the backword al- 
gorithm locates the last content-bearing entry In the 
an'ay. This entry must not be a space. Next, a 
second scan is initiated to locate the next prior 
space or punctuation mark. The scan logic must 

10 include the possibility that a word may be frag- 
mented into two or more contiguous tokens or that 
the token being deleted is the first token in the 
document 

Once the right-hand boundary of the backword 
IS is identified, all of the subsequent tokens are de- 
leted by adjusting the length of the modifiable 
token array. Notice that, since the backword al- 
gorithm always results in the elimination of an 
integral number of tokens, no new tokens are cre- 
20 atad. 



Insert 

25 

The insert algorithm also requires the modifi- 
able entry array, so the insert algorithm also begins 
by determining what content, if any, from the input 
focus must be moved into the special entry array. 

30 If the insertion point Is on the left-hand side of the 
input focus, then a new piece table entry must be 
created and linked Into the table prior to the input 
focus. The new piece is now made the input focus 
and the insertion point Is moved to the right-hand 

35 side of this piece. 

If the insertion point is on the right-hand side of 
the input focus and the;corrtent of the Input focus is 
not in the modifiable entry array, then the content 
of the input focus Is copied there and a reference 

40 to the modifiable entry array is placed in the Input 
focus. 

Now a character can be added to the docu- 
ment. The logic that is executed depends on the 
class of character. The baste classes are space, 

45 letters, and punctuation marks. Punctuation marks 
are characters that are never included in a token. 
These are also identified to the tine-breaking al- 
gorithm to assist it in determining where allowable 
break points are. In applications where significant 

50 amounts of numeric data are expected - such as 
financial text processing applications - then the 
implementation may need to treat numeric data in 
a special way. Since numbers tend to be unique, or 
nearly unique, the statistical properties that make 

S5 this invention highly efficient for text will not be 
realized. The result will, very likely, be very large 
documents with prohibitive runtime memory re- 
quirements. One alternative is to treat tabular data 
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as normal strings of text, or to create a special 
escape variant that would contain the numeric data. 

When the first letter is encountered, the final 
token in the input focus is inspected. If it is not a 
reference to the modifiable token properties the 
reference Is added to the end of the piece. This 
usually involves lengthening the contents of the 
input focus by two. A changeBase escape variant 
referring to the space containing the modifiable 
token properties is added, followed by a Token. 
Entry referring to the modifiable properties. The 
metrics, break-point data, display bitmap, and char- 
acter array should all be Initialized to null values. If 
the final entry in the an^y is a token variant with 
the spaceFollows Boolean term set to FALSE, then 
the entry is replaced with the two entries required 
for the reference to the modifiable token properties. 
In this case the contents of the rrnxlifiable prop- 
erties are initialized to be the contents of the token 
that was replaced. 

Now the latter Is added to the character anray 
and the font Is set to a value that forces a recom- 
putatlon of the metrics and bitmap. In general, it is 
recommended that hyphenation of the new token 
be delayed until a break character (l.e. a space or 
punctuation mark) is encountered, unless the hard- 
ware is very fast 

Processing the second and subsequent 
(contiguous) letters consists of only the final step of 
those required for the processing of the first tetter 
the letter is added to the character anay and the 
font field in the token properties Is set to the 
special value that causes the invalidProps return 
from the line-breaking algorithm. 

When a character that is not a tetter is Input 
after one or more letters, or the input operation is 
terminated, the contents of the modifiable token 
properties are converted into a permanent token. 
Since the new token may not have the same base 
address as the last token before the modifiable 
token properties, a changeBase escape may be 
needed. 

If the event that temiinates the input to the 
modifiable token properties is a space, then the 
space should be encoded with the spaceFollows 
Boolean term in the new token. If a space is input, 
but no letters are in the modifiable token, then the 
space escape variant must be used. Similarly, if a 
punctuation mark is input after one or more tetters 
have been input, then the modifiabte token prop- 
erties must be converted into permanent properties 
prior to processing the punctuation mark. 



The Search Algorithm 



In general, a user searches for a string of text 



with the expectation that the algorithm will deter- 
mine whether two strings are equivalent. Equiv- 
alence is defined to mean that the encoding for 
one string can be transfonrr^ed into the encoding for 

5 the other string with no loss of information. Simi- 
larly, when searching for combinations of text and 
commands, the expectation is that the effect of the 
commands will be recognised independent of the 
sequence of commands. For example, if the com- 

10 mand for initiating bold text is 
<BD> 

and the command for italic text is 
<IT> 

then clearly, the effect on the text when the com- 
rs mand sequence 

<BD><n> 

is the same as 
<rT><BD> 

Therefore, the two should be Interpreted by tfie 
20 search as equivalent Conceptually, this is equiv- 
atent to accumulating the commands in a buffer 
and inspecting them before the beginning of each 
text piece. 

The equivalence of tokens is intuitive: a word 
25 that is spilt between multipte contiguous tokens 
must match the same content represented as a 
single token. The following algorithm will be pre- 
sented in terms of equivalent tokens. Similarly, the 
equivalent encoding of a sequence of commands is 
30 also natural. Thus, the algorithm for searching 
when fonmatting commands are significant (that is 
that the looks of the search string are included in 
the search criteria) will also be presented in terms 
of equivalent commands that are independent of 
35 the sequence of commands. 

The search consists of processing three spe- 
cial cases: the equivalent token prior to the first 
delimiter, the equivalent token after the last de- 
limiter, and the rest of the tokens. In this context a 
40 delimiter is defined to tie a space or punctuation 
maik. The scan proceeds until a token is identified 
that Is exactly equivalent to. or ends in text that is 
exactly equivalent to. the first token. The algorithm 
now processes each successive delimiter and inte- 
rs rior token. If the match is exact the scan proceeds. 
Finally, the last token is compared. A match is 
identified if the final equivalent token in the search 
string is a perfect match with the corresponding 
token in the passage of text being searched. Simi- 
50 lady, the search is successful if the final equivalent 
token matches the begining of the corresponding 
token in the text. 

A slight generalization is required to define the 
search algorithm when formating commands in the 
55 target string arc significant. Before the scan, the 
source string is converted into tokens and the 
effective commands are determined. At the begin- 
ning of each text piece, the state of the effective 
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command is compared with the corresponding 
commands in the source string. If the affective 
command does not match, the comparison fails. 



Search and Replace 



The search and replace function is defined in 
terms of previously-defined edit operations. Rrst, 
the source and replace strings are converted into a 
sequence of tokens. Then, the scan of the docu- 
ment Is performed. When a match is made, the 
matched portion of the document is converted into 
an implied selection. The selection is then deleted 
using the previously-defined operation. The en- 
coded replace '^n is then Inserted at the inser- 
tion point. The ;a!i then proceeds to locate the 
next match. 



Checking Spelling 



The algorithm to check the spelling of the text 
in a document encoded with this method Is very 
efficient There are two cases to consider, depend- 
ing on whether the entire document is checked, or 
whether only a selected portion of the document is 
checked. 

Prior to processing the document, two tem- 
porary data structures are created. The first is a list 
with an element for each unique token. Each ele- 
ment contains a reference count, and a Boolean 
field. The elements are initialized with the respec- 
tive reference count and the value FALSE. The 
second is a list to retain each misspelled word that 
is composed of more than one token. This list is 
initially empty. 

Processing the document begins with checking 
each entry in the document token dictionary 
against the spelling lexicon. Processing next pro- 
ceeds to setting the appropriate Boolean fieW in 
the temporary list to TRUE for tokens that are 
identified as misspelled by the spelling lexicon. 
The document is now scanned for words that are 
composed of several contiguous tokens. For each 
instance of a composite token, the temporary refer- 
ence count for each token in the composite is 
decremented, then the word Is checked against the 
spelling lexicon. If the word is incorrectly spelled, 
the word Is added to the misspelled list- if it is not 
already in the list. At the end of the document, 
words in the list of misspelled composite words are 
reported to the user. Similarly, token dictionary 
entries that are marked as misspelled and that 
have a non-zero temporary reference count are 
also reported to the user. 



In the second case of checking a portion of the 
text, only a list to contain the words that are mis- 
spelled is required. The processing consists of 
enumerating each word in the selected text, wheth- 
6 er or not it is composed of several fragments, then 
checking the word against the spelling lexicon. 
Words that are misspelled are added to the mis- 
spelled list. After the last token is identified, the 
misspelled list is reported to the user. 

TO 

External Document Representation 



ts The text representation described fiere, coup- 
led with the piece table, can be used as an ex- 
tremely compact machine-readable external form. 
Minimizing the external storage requirements con- 
sists of saving the piece table; the memory oon- 

20 taining the Token. Entry arrays that the piece table 
refers to, and a subset of the token properties. 
Only the portion of the token properties that cannot 
be recomputed quickly needs to be virritten into the 
file. Thus the characters; an array noting the type 

25 and position of each break point, and the base 
address and offset of the properties should suffice. 
By not storing the break-point array; metric in- 
formation, and token bitmap image, the property 
space can be reduced by at least three-quarters. 

30 Typically, for large documents the compaction 
will be a factor that converges on twice the inverse 
of the average word length in the document For 
standard English the ratio is slightly less than one 
third. For technical documentation that contains 

36 large numbers of tonger words, the ratio will im- 
prove. 



Claims 

40 

\A method of encoding text comprising a 
string of atoms comprising words, numerals, punc- 
tuation marks and spaces, into a list of tokens and 
a corresponding array of data blocks, comprising 
45 the steps ot 

determining whether a current atom in a string is 
identical to a previous atom that has been encoun- 
tered earlier in a string; 

if so. adding the token of a previous atom to a list 
50 of tokens, and proceeding to the next atom; 

if not adding a new and unique token to the list of 
tokens, creating a new data block corresponding to 
the new token, comprising the string of characters 
in the current atom, and proceeding to the next 
55 atom; 

adding the width of the atom to the data block so 
that the data block can be used to determine where 
the current line can be broken: 
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for each data block that contains a word, adding 
the hyphenation points of word, and the widths of 
each partial word thus produced by the hyphen- 
ation points, to the data block; 
adding an atom bit map to the data block during 
the creating step, and 

for each current line, storing the atom bit maps into 
a line buffer to aeate a line bit map for raster 
Imaging. 

2. The method of breaking fines of text com- 
prising a string of atoms comprising word, nu- 
merals, punctuation marks and spaces, into a list of 
tokens and a conresponding array of data blocks, 
and then deleting an atom therefrom, comprising 
the steps of: 

determining whether a current atom in a string is 
identical to a previous atom that has been encou^^ 
tered earlier in the string: 

a. if so. adding the token of the previous atom to a 
list of tokens, and proceeding to the next atom; 

b. if not, adding a new and unk^ue token to the list 
of tokens, creating a new data block con^sponding 
to the new token comprising the string of char- 
acters in the current atom, and proceeding to the 
next atom; 

adding the width of the atom to the data t)ock so 
that the data bk}ck can be used to detemnine where 
the current line can be broken; 
using the current token to access the con^ond- 
ing data block; 

adding the width of the cunrent atom in the data 
block to a running sum of widths on the cun'ent line 
to determine the total width of all atoms on the 
cunrent line, and 

if the total width is sufficient to fill the line, then, 

a. if the prevtous token were a space, breaking the 
line at the end of the current atom, otherwise, 

b. breaking the line at the end of the last token on 
the line that was followed by a space. 

if the total width is not sufficient to fill the line, 

proceeding to the next token, and then deleting the 

atom, comprising the steps of: 

deleting the token associated with the atom, and 

deleting the data block associated with the deleted 

token if there are no further references to the data 

bkick. 

3. The method of encoding text comprising a 
string of atoms comprising words, numerals, punc- 
tuation marks and spaces, into a list of tokens and 
a corresponding array of data blocks and using 
each successive token in the token list to deter- 
mine a line break point, and then deleting an atom 
in a line thus formed, comprising the steps of: 
determining whether a current atom in a string is 
identical to a previous atom that has been encoun- 
tered eariier in the string. 

if so. adding the token of the previous atom to the 
list of tokens, and proceeding to the next atom. 



if not, adding a new and unique token to the list of 
tokens, creating a new data block corresponding to 
the new token comprising the string of characters 
in the cun'ent atom, and proceeding to the next 
5 atom, 

adding the width of the atom to the data block so 
that the data block can be used to determine where 
the current line can be broken, 
for each data block that contains a word, adding 

70 the hyphenation points of the word, and the widths 
of each partial word thus produced by the hyphen- 
ation points, to the data block; 
using the current token to access the correspond- 
ing data block; 

Ts adding the width of the current atom in the data 
block to a running sum of atom widths on the 
current line to determine the total width of all atoms 
on the current line, 

if the sum of widths is not sufficient to fill the 

20 current line, proceeding to the next token, 

If the sum of widths is sufficient to fill the line, 
breaking the line at the end of the current atom and 
proceeding to the next line, 
if the the sum of wkJths is too great to fit on the 

25 current line, dividing the current word at a hyphen- 
ation point adding the beginning of the hyphenated 
word to the end of the cunrent line, and starting the 
next line with the ending portion of the current 
word, and then deleting an atom, comprising the 

30 Steps to: 

deleting the token associated with the atom, 
deleting the data block associated with the deleting 
token if there are no further references to the data 
block, and 

35 repeating the line-breaking steps, 

where in two atoms are defined as identical when 
they have the same characters, style, size, weight, 
stress and capitalisatkin. 

4. The method as claimed in Claim 1, compris- 
40 ing the steps of: 

deleting the token associated with the atom, 
deleting the data block associated with the deleted 
tokeri if there are no further references to the data 
block, 
45 and 

beginning with the first token on the line, breaking 
the line in accordance with the method of Claim 2. 

5. The method of breaking lines of text com- 
prising a string of atoms comprising words, nu- 

50 merals, punctuation marks and spaces, into a list of 
tokens and a corresponding array of data blocks, 
comprising the steps of: 

determining whether a current atom in a string is 
identical to a previous atom that has been encoun- 
ss tered earlier in the string; 

if so. adding the token of the previous atom to the 
list of tokens, and proceeding to the next atom; 
if not. adding a new and unique token to the list of 
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tokens, creating a new data block conresponding to 
the new token comprising the string of characters 
in the current atom, and proceeding to the next 
atom, 

adding the width of the atom to the data block so 
that the block can be used to detenmine where the 
current line can be broken, 
using the current token to access the correspond- 
ing data block, 

adding the width of the cunrent atom in the data 
btock to a running sum of widths on the cun-ent line 
to detemilne the total width of all atoms on the 
current line, and 

[f the total width Is sufficient to fill the line, then, 

a. if the previous token were a space, breaking the 
line at the end of current atom, othenvlse, 

b. breaking the tine at the end of the last token on 
the line that was foltowed by a space, and If the 
total width is not sufficient to fill the line, proceed- 
ing to the next token, and the deleting a character 
by: 

substituting therefor a first token and a data btock 
for the fragnnent of the atom preceding the deleted 
character, and a second token and data block for 
the fragment of the atom after the deleted char- 
acter, 

deleting the data block associated with the deleted 
token if there are no further references to the data 
block, and 

beginning with the first token on the line, repeating 
the line-breaking steps. 

6. The method of encoding text comprising a 
string of atoms, said string of atoms comprising 
words, numerals, punctuation marlcs and spaces, 
into a list of tokens and a con-esponding array of 
data bk>cks and using each successive token in the 
token list to determine a line break point compris- 
ing the steps of: 

determining whether the current atom in said string 
is identical to a previous atom that has been en- 
countered earlier in said string, 
if so, adding the token of said previous atom to 
said list of tokens, and proceeding to the next 
atom, 

if not, adding a new and unique token to said list of 
tokens, creating a new data block corresponding to 
said new token comprising the string of characters 
in said cun-ent atom, and proceeding to the next 
atom, 

adding the width of the atom to said data block so 

that said data block to a running sum of atom 

widths on the cunrent line to determine the total 

width of all atoms on the current line. 

If the sum of widths is not sufficient to fill the 

current line, proceeding to the next token, 

if the sum of widths is sufficient to fill the line, 

breaking the line at the end of the current atom and 

proceeding to the next line, 



if the sum of widths is too great to fit on the current 
line, dividing the current word at a hyphenation 
point, adding the beginning of the hyphenated word 
to the end of the current line, and starting the next 
s line with the ending portion of the current word, 
and 

wherein In two atoms are defined as identical when 
they have the same characters, style, size, weight, 
stress and capitalization, and then deleting a char- 
10 actor from an atom in a line of text thus formed 
comprising the steps of: 

deleting the token associated with the atom to be 
changed, 

substituting fheretor a first token and data btock for 
15 the fragment of the atom proceeding the deleted 
character, and a second token and data btock for 
the fragment of the atom after the deleted char- 
acter, 

deleting the data block associated with the deleted 
20 token if there are no further references to the data 
btock. and 

beginning with the first token on the lir... ak the 
line in accordance with the line breaking steps. 

7. The mettrod of deleting a character from an 
25 atom In a line of text formed by the method of 

Claim 8 comprising the steps of: 
delating the token and data block associated with 
the atom to be changed, proceedinc, the defefsd 
character, and a second token and data btock for 
30 the fragment of the atom after the deleted char- 
acter, 

deleting the data btock associated with the deleting 
token if there are no further references to the data 
block, and 

35 beginning with the first token on the line, breaking 
the line in accordance with the method of Claim 2. 

8. The method of breaking lines of text com- 
prising a string of atoms, said string of atoms 
comprising wonjs, numerals, punctoatiori marks 

40 and spaces, into a list of tokens and a correspond- 
ing an'ay of date blocks comprising the steps of: 
determining whether the current atom in said string 
is identical to a previous atom that has been en- 
countered earlier in said string. 

45 if so, adding the token of said previous atom to 
said list of tokens, and proceeding to the next 
atom, 

if not. adding a new and unique token to said list of 
tokens, creating a new date block corresponding to 
50 said new token comprising the string of characters 
in said current atom, and proceeding to the next 
atom. 

adding the width of the atom to said date block so 
that said data block can be used to determine 
55 where the current line can be broken, 

using the current token to access the correspond- 
ing data block. 

adding the width of the current atom in the data 
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block to a running sum of widths on the current line 
to determine the total width of all atoms on the 
current line, and 

if said total width is sufficient to fill the line. then. 

a. if the previous token was a space, breaking the 
line at the end of the cun-ent atom, otherwise, 

b. breaking the line at the end of the last token on 
the line that was followed by a space, and 

if said total width is not sufficient to fill the line, 
proceeding to the next token, and then adding a 
character to an atom in a line of text, comprising 
the steps of: 

deleting the token associated with the atom to be 
changed. 

substituting therefor 

(1) a first token and a data btock for the fragment 
of the atom preceding the Inserted character, 

(2) a second token corresponding to 

(a) a data btock created from the method in Claim 
1 if the character is a space or punctuation mark, 
or 

(b) special data btock if the character is a letter or 
number, and 

(3) a third token and data block for the fragment of 
the atom after the inserted character, and 
beginning with the first token on the line, breaking 
the line In accordance with line-breaking steps. 

9. The method as claimed in Claim 8, including 
adding a second character to an atom in a line of 
text, comprising the steps of: 

(1) if the character is a letter or number, 
adding the now character to the special data Wock. 

(2) if the new character Is a space or punc- 
tuation mark, 

(a) using the character in the special date 
block to determine the token for the new atom. 

(b) deleting the token corresponding to ttie 
special date block, 

(c) substituting the reference to the token 
determined in (a) 

(e) inserting a second token correspond- 
ing to a created date block for the character if it is 
a space or punctuation mark, 

(0 processing the next character and be- 
ginning with the first token on the line, breaking the 
line in acconlance with the line-breaking steps. 

10. The method of encoding text comprising a 
string of atoms, said string of atoms comprising 
words, numeral, punctuation marics and spaces, 
into a list of tokens and a corresponding array of 
date blocks and using each successive token in the 
token list to determine a line break point compris- 
ing the steps of: 

determining whether the current atom in said string 
is identical to a previous atom that has been en- 
countered eariier in said string, 
if so. adding the token of said previous atom to 
said list of tokens, and proceeding to the next 



atom, 

if not. adding a new and unique token to said list of 
tokens, creating a new date block con-esponding to 
said new token comprising the string of characters 
5 in said current atom, and proceeding to the next 
atom. 

adding the width of the atom to said data block so 
that said date block can be used to determine 
where the current line can be broken, 
70 using the cunrent token to access the correspond- 
ing data block. 

adding the width of the current atom in the date 
block to a running sum of atom widths on the 
cunrent line to determine the total width of all atoms 

75 on the current line, 

if the sum of widths is not sufficient to fill the 
cunrent line, proceeding to the next token, 
if the sum of widths is sufficient to fill the line, 
breaking the line at the end of the current atom and 

30 proceeding to the next line, 

if the sum of widths is too great to fit on the current 
line, dividing the current word at a hyphenation 
point adding the beginning of the hyphenated word 
to the erxi of the cunrent word, and then adding a 

25 character to an atom in a line of text thus formed 
by: 

deleting the token associated with the atom to be 
changed, 

substituting therefor 
30 (1) a first token and a date block for the fragment 
of the atom proceeding the inserted character, 

(2) a second token con-espondlng to 

(a) a date block created from the method in Claim 
1 if the character is a space or punctuation mark. 

35 or 

(b) special date block if the character is a tetter or 
number. 

(3) a third token and data block for the fragment of 
the atom after the inserting character, and 

40 beginning with the first token on the line, breaking 
the line in accordance with the method of Claim 3, 
wherein two atoms are defined as identical when 
they have the same characters, style, size, weight, 
stress and capitalization. 

45 



so 



55 



18 
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The approach used here is to encode the entire docu- 
ment as a sequence of tokens (or numbers) where each 
unique punctuation mark, space, or sequence of contiguous 
characters separated by a space or punctuation mark is 
identified with a unique tokeQ 



. During an editing* session a 
table is constructed with a set of properties for each toke' 



Token properties include the last fo^t the token was 
encountered, the type of token, as well as metric 
information for the entire word and each hyphenation 
point To demonstrate the use of escape entries each of 
the words in this sentence will be separated by two 
spaces. 



Figure 5 



EP 0 391 706 A2 



DIRECTORY 
TokenProps USING [Offset!; 

Token: DEFINITIONS r BEGIN 

EntryType: TYPE a MACHIN E DEPENDENT {token(0), escaped )}• 

EscapeTyperTYPE = MACHINE DEPENDENT {$pac€(0). leroWidthSpaced). changeBase(2)l; 
ThirteenSits: TYPE = (0..81911; 

Entry: TYPE = MACHINE DEPENDENT RECORD [ 
SELEa entryType: EntryType FROM 
token > 
(spaceFollows: BOOLEAN, 
offset: TokenProp5.0ffsetJ, 
escape a > 
[SELECT escapeType: EscapeType FROM 
changeSase a > [newSase: ThirteenBitsL 
space s > 0, 
zeroWldthSpace = > Q. 
ENDCASEL 
ENDCASE]; 

EncodedText: TYPE s LONG DESCRIPTOR FORARRAY CARDINAL OF Entry; 
END. 
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[ escape [changeBas [newBase:!] ] ] , 

[token [spacePol lows: TRUE, ffset 

(token [spaceFollows:TROE, offset 

[t ken [space?ollows:TRaE, offs t 

[token [spaceFollowszTROE, offset 

[token [spaceFollows:TRUE, offset 

[token [spacePollows:TROEr offset 

(token [spaceFoilowszTRUE, offset 

[token IspacePoXlowsxTRUE, offset 

[token [spaceFollows:TRaE, offset 

[token [spaceFolIows:TRDEr offset 

[token [spaceFollowssTROEr offset; 

[token [spaceFollowstTRUE, offsets 

(token [spaceFollowszTROEr offset i 

[token (spaceFollows:TROE, offset; 

[token [spaceFollows:TROE, offset! 

[token (spaceFol lows J FALSE, offset; 

[token [spaceFollows:TROE, offset i 

[token (spaceFollowst FALSE, offset j 

[token [spacePollowsiTRDE, offsets 

(token (spaceFollow3:TR0Er offset: 

[token [spaceFollowssTRUE, offsets 

(token [spaceFollow3:TRDE, offset: 

[token [spaceFollowssTROB, offset: 

[token [spaceFollows; FALSE, offset: 

(token [spaceFollowssTRas, offset: 

[token [spaceFollows: FALSE, offset: 

[token [3paceFollows:TRO£, offset: 

[token [spaceFollows:TRUE, offset: 

[token [spaceFollowssTROE, offset: 

[token (3paceFollows:TRUE, offset: 

[token [spaceFollowsxTRUE, offset: 

[token [spaceFollows:TRU£, offset: 

[token (spaceFollows:TROE, offset: 

(token [spaceFollows:TRDE, offset: 

(token [spaceFollows:TROE, offsets 

(token (spaceFollowstTROB, offset: 

[token (spaceFollows:TRUEr offset: 

[token (spaceFollows:TRaE, offset: 

(token (spacePollows:TRUEr offset: 

(token [spacePollowsxTRaE, offset: 

[token [spaceFollows:TRUE, offset: 

[token [spaceFollows:TRaEr offset: 

(token [spaceFollows:TRUE, offset: 

(token [spaceFollows:TRUE, offset: 



HI, 
1 20IJ, 
: 4411, 
6311, 
8211, 
10111, 
12011, 
141]), 
16011, 
18411, 
21111, 
23011, 
256]], 
28011, 
29911, 
32311, 
34211, 

36111/ 
3851], 
404)1, 
42511, 
44411, 
4651J, 
512)1, 
531]], 
55011, 
53111, 
3421), 
25611, 
28011, 
57111, 
60011, 
62911, 
66111, 
23011, 
55011, 
342]), 
46511, 
512]], 
8211, 
680]], 
70911, 
23011, 
44411, 



Then 

approachn 
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heren 

too 

encode^ 

then 

entireo 

documenta 
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an 

sequencea 
ofd 

tokensa 
orD 

numbers 

^SiereQ 

eachD 

uniquea 

punctuatlona 

mark 

space 

sequence^ 
ofD 

contiguousa 

charactersQ 

separatedn 

bya 

an 

spacea 
oro 

punctuatlona 
markn 

idlintifieda 
witha 

aa 

uniquen 
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[token 
[token 
(token 
(token 
(token 
[token 
(token 
[token 
[token 
[token 
[ token 
[ token 
[token 
[token 
[token 
[token 
[token 
[token 
( token 
[token 
(token 
[token 
[token 
[token 
[ token 
[ token 
(token 
( token 
[ token 
[token 
[token 
( token 
(token 
( token 
[ token 
(token 
(token 
(token 
[ token 
[token 
[ token 
[ token 
( token 
( token 
[ token 



[ spacePol lows : FALSE, 
( spacePol lows j TRUE , 
( spacePol lows : TRUE , 
[ spacePol lows s TRUE , 
( spacePol lows : TRUE , 
( spacePol lows ; TRUE , 
( spaceFollows :TRUE, 
(spacePollow5;TRUB, 
[ spaceFollows :TRUE, 
[ spaceFollows :TRUE, 
[ spaceFo 1 lows : TRUE , 
I spacePol lows i TRUE , 
[ spacePol lows : TRUE , 
( spaceFollows : TRUE, 
I spacePol lows : TRUE , 
[ spacePol lows : TRUE , 
( spacePol lows : TRUE, 
[ spaceFollows i FALSE , 
[ spaceFollows : TRUE , 
( spaceFollows : TRUE , 
[ spaceFollows : TRUE , 
[spaceFollows sTRUE, 
[ spaceFollows :TRUB, 
[ spaceFollows sTRUE, 
( spaceFollows :TRUE, 
( spacePol lows : TRUE r 
( spacePol lows : TRUE , 
( spacePol 1 o ws : TRUE , 
( spacePol lows : FALSE , 
[spaceFollows :TRUEr 
[ s pa cePo 1 1 o ws : TRUE , 
(spaceFollows:TRUE, 
( spacePol lows : TRUE , 
( spacePol lows i FALSE , 
( SpaceFollows : TRUE, 
( SpacePol lows i TRUE , 
( spacePol lows : TRUE , 
[ spacePol lows : TRUE , 
(spaceFollows: TRUE, 
[ spacePol lows : TRUE , 
( spacePol lows : TRUE , 
[ spaceFollows : TRUE, 
( spacePol lows : TRUE , 
[ spacePol 1 ows ; TRUE , 
[spaceFollows :TRUE, 



offset 
offset 
offset 
offset 
offset 
offset 
offset 
offset 
offset 
offset 
offset: 
offset: 
offset: 
offset: 
offset; 
offset: 
offset: 
offset: 
offset: 
offset: 
offset: 
offset 
offset 
offset 
offset 
offset 
offset 
offset 
offset 
offset 
offset 
offset 
offset 
offset 
offset 
offset 
offset 
offset 
offset 
offset 
offset 
offset 
offset 
offset 
offset 



72811, 
76811, 
78711, 
81111, 
83011, 
85411, 
23011, 
87811, 
8211, 
90211, 
70911, 
23011, 
53111, 
28011, 
95011, 
97911, 
42S11, 
72811, 
76811, 
99811, 
t 95011, 
«10241], 
« 14111, 
«1048]1, 
tl067]l» 
« 14111, 
t 72811, 
:1086]], 
S110511, 
53111, 



: 141Ih 
:n34Il, 

i 280] Ir 
: 72811, 
: 53111, 
2115311, 
:1172]], 
:1153]], 
:11911], 
5121511, 
: 97911, 
s 14111, 
5 16011, 
:124711, 
:1280]1, 
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[token [3pacePollows:TRUEr 

t token I spacePollows : TROE, 

[token [spacePollows: FALSE 

(token (spacePollows:TROE, 

[token t spacePollows s TRUE, 

[ escape ( space ( J J 1, 

[token [spacePollows :TRDE, 

( escape (space( 11], 

[token (spacePollows :TRCJEr 

(escape(space[ll J, 

(token [spacePollowsiTROE, 

[escapelspacedllr 

(token [spacePollows: TRUE, 

(escape [spaced]], 

[token [spacePollows sTRDE, 

(escape[space[]]], 

[token [spacePollows: TRUE, 

(e8cape[space[]]], 

( token [ spacePollows : TRUE, 

[escape[space[]]I, 

[token (spacePollows: TRUE, 

(escape[space[ 111, 

[token (spacePollows :TRUB, 

[escdpe[space[] ]], 

[token [spacePollows: TRUE, 

( escape [space( 11], 

(token [spacePollows: TRUE, 

(escape[space(ll], 

[token [spacePollows: TRUE, 

(escape(3pace[] J], 

[token [spacePollows: TRUE, 

[escapet spaced]], 

(token (spacePollows: TRUE, 

[escape[ spaced J], 

[token I spacePollows :TRUB, 

(escape[spaced]], 

(token [spacePollows: TRUE, 

[ escape [ space [ 1 1 1, 

[token [spacePollows:TRUE, 

(escape(spacedll, 

(token [spacePollows: TRUE, 

[escape (spaced I ], 

(token ( spacePollows :PALSE, 

[token [spacePollows: FALSE, 



offset: 4251], 
offset:12991], 
, offset:13281], 
offset: 768]], 
off3et:1349]l, 

offset:1368J], 

offset: 141]], 

offset: 1397] Ir 

offset: 2801 1, 

offset: 14161], 

offset:144011, 

offset: 425]], 

offset: 280]], 

offset: 14111, 

offset: 1464]], 

offset:1485]l, 

Offset:1504II, 

off set 1 15361 J, 

offset:1560]], 

offset:15791], 

offset: 629]], 

offset: 661] J, 

offset:1598]lr 

offset:16l7Jl, 
offset: 768]] 



eacho 

hyphenationa 
point 

a 

To 

Smonstraten 



ICQ 
□ 

useD 
□ 

□ 

escape^ 
□ 

entriesa 

□ ^ 
eacha 

ofd 

Sea 

wordsQ 
□ 

inQ 

SisQ 
□ 

sentenceG 
□ 

wiiin 

EeQ 
□ 

separatedn 
□ 

byQ 
□ 

tWOQ 
□ 

spaces 



Table 2 continued 



EP 0 391 706 A2 



DIRECTORY Environment; 

TokenProps: DEFINmONS = BEGIN 

BreakType: TYPE = MACHINE DEPENDENT 

{hard(O), di$cretionary(1), best(2), betterO). good(4), ok(5), sp€ciaIGermanCa$e(2SS}}; 

lastPropsOffcet: PRIVATE CARDINAL « t6384; 

Base: TYPE a LONG BASE POINTER TO Object; 

Offset: TYPE » Base RELATIVE POINTER [0..lastPropsOff$et) TO Obiea* 

Handle: TYPE = LONG POINTER TO Object; 

TrailerHandle: TYPE = LONG POINTER TO ObjectTrailer; 

maxPages: CARDINAL = lastPropsOffeet/EnvtronmentwordsPerPage; 

Object: TYPE s MACHINE DEPENDENT RECORD I 

notPunctuation (0:0..0): BOOLEAN, 
font (0:1. .15): FontATOM, 
atomMetria(l): AtomMetiics, 
breakPotntCdunt(4): CARDINAL, 

breakpoint (5): SEQUENCE COMPUTED CARDINAL OF BreakPointOateJ; 

O^ectTrailer: TYPE » MACHINE DEPENDENT RECORD ( 
referenceCount: CARDINAL. 

^**: PACKED SEQUENCE COMPUTED CARDINAL OF EnvironmentByteJ; 

AtomMetrics: TYPE » MACHINE DEPENDENT RECORD ( 
mlcaLength(0): NATURAL. 
pixelLength(l): NATURAL, 
byteCount<2): NATURALI; 

BreakPointOata: TYPE a MACHINE DEPENDENT RECORD [ 
type(0:0..7): BreakType, 
byteCount (0:8.. 15): EnvironmentByte. 
micaLength(l): NATURAL, 
pixelLength (2): NATURALI; 

FontATOM: TYPE = NATURAL; 

END 
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The approach used here is to encode the entire docu- 
ment as a sequence of tokens (or numbers) where each 
unique punctuation mark, space, or sequence of contiguous 
characters separated by a space or punctuation mark is 
identified with a unique token. During an editing session a 
table is constructed with a set of properties for. each token. 
Token properties include the last font the token was 
encountered, the type of token, as well as metric 
information for the entire word and each hyphenation 
point To demonstrate the use of escape entries each of 
the words in this sentence will be separated by two 
spaces. 

Table 4 
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DIRECTORY 
Environment 
ESCAiphaExtras 
Inline 
Mopcodes 
Token 
TokenPfops 



USING (ByteL 
USING [aUNEBREAKL 
USING [DBITANDL 
USING [zESq« 
USING tEncodedTexth 
USING [Base, FontATOM]; 



UneBreak: DEFINITIONS 

IMPORTS Inline - BEGIN 

SuffixChar TYPE = MACHINE DEPENOENT £space(0). zeroWidthSpace(1). hyphen(2), 

hardHyphen(3),null(4)J; 

Reason: TYPE s MACHINE DEPENDENT 

{margjn(O). ncnmald), changeBase<2), invaHdProps(3K 
cont{9uou$Words(4), unableToBreak(5), spedalGermanCase(S)}; 

TwelveBits: TYPE « C0..4095I; 

ArgRec: TYPE = MACHINE DEPENDENT RECORD ( 



text(0): 
profxAaseQ): 
hyphenate (5:0..0): 
font(5;1..15): 
margin (Q: 

hyphenPixelLength (7:0..7): 

minSpacePixelLength (7:8..15): 

hyphenMicaLength (8): 

minSpaceMicaLength (9): 

whiteSpace(IO): 

final(ll): 

prior(17): 



Token.EncodedText 

TokenProps.Base, 

BOOLEAN, 

TokenProps.FantATOM, 

NATURAL, 

Environment Byte, 

Envi'ronment.Byte, 

NATURAL, 

NATURAL, 

NATURAL, 

State. 

Statel; 



ArgHandle: TYPE a LONG POINTER TO ArgRec; 

ArgSpace: TYPE =-ARRAY[l.. SIZE (ArgRec) + argAlignment) OF UNSPECIFIED; 
argAlignment: PRIVATE CARDINAL = 32; 
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State: TYPE = MACHINE DEPENDENT RECORD 

[index (0): CARDINAL, 

micaLength(l): INTEGER, 

pixelLength(2): INTEGER, 

count (3:0.. 1 1): TwelveBits, 

notPunctuation (3: 12..12): BOOLEAN, 

suffixChar Q: 1 3.. 1 5): SufflxChar, 

byteCount(4): INTEGER, 

whiteSpace(5): NATURAL]; 

AiignArgRec: PROCEDURE Pp: LONG POINTER TO ArgSf^ace] 
RETURNS (ArgHandlel = INUNE 
{RETURN [Inline.DBITAND dp + argAllgnment- KOFFFFFFEOH]]}; 

SoftwareUneBreak: PROCEDURE [arg: ArgHandle] 
RETURNS [reason: Reason]; 

UneBreak: PROCEDURE [arg: ArgHandlel 

RETURNS (reason: Reason] = MACHINE CODE 
{Mopcode$.2ESC, ESCAtphaExtras.aLINEBREAK}; 

END. 
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DmeaoRY 

ESCAIphaExtras. 

Frame, 

UneBreak, 

PrincOps, 

TokenProps; 

LineBreaklmpI: PROGRAM 

IMPORTS Frame 

EXPORTS LineBreak a BEGIN 

nulIData: LineBreak.State = IO,0,0,0,FALSE,nuH,aO]; 

SoftwareUneBreak: PUBLIC PROCEDURE larg: LineBreak J\rgHandle] 
RETURNS (reason: UneBreaLReason] s BEGIN 



props: 
trailer: 
margin: 

tNsBreak, pending: 

width: 

breakNum: 

fit: 

minWidth: 
final, prior: 
aM: 



TokenProps-Handle; 
TokenProps-TrailerHandle; 
NATURAL s arg.margin; 

LONG POINTER TO TokenProps.8reakPointData NIL; 

CARDINAL; 

CARDINAL 4-0; 

NATURAL 4-0: 

NATURAL: 

LONG POINTER TO UneBreak.State; 
TokenProps^tomMetncs; 



WholeWordBreak: PROCEDURE * INUNE {final t «- null Data; finai.index<- prior index + 
SaveThisBreak: PROaDURE = INLINE {final.notPunctuation FALSE; prior t final f}; 
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ProcessSpace: PROCEDURE RETURNS (BOOLEANI = INLINE BEGIN 
finaLsuff ixChar <- space; 
SaveThisSreak Ul 

width4-finaLniicaLength + arg.minSpaceMicaLength; 
IF width > margin THEN RETURNfTRUE]; 
final t <- lindex: final.index. 
rnicaLength: width. 

pixelLength; final.pixelLength + arg.minSpacePixelLength 
count: ftnat.count i- 
notPunctuation: FALSE, 
suffixChar: space. 
byteCoont: final.byleCount + 1, 
whiteSpace: final.whiteSpace + arg.whiteSpacel: 
RETURNtFALSEI; ^ 
END; 

HyphenateWord: PROCEDURE o INLINE BEGIN 
IF finaLnotPunctuation THEN GOTO unableToBreakExif 
IFprops-breakPointCount a OORNOTarg.hyphenateTHENGOTOnoneFitsExit; 

- if the last full atom fits in the white space then don't hyphenate 

IF (margin -prlor.whiteSpaceX 3 priormicaLength THEN 
GOTO noneRtsExit; 

- pick the hyphenation point with highest desirabilfty that fits in the 

- white space, otherwise pick the one with the tightest fit 
thisBreak «-@props.breakPoint[OI; 

minWidth *- margin - final.whiteSpace; 
DO 

IF thtsBre8k.type = spedalGermanCase THEN GOTO spedalGermanExit; 

rllirirrL*" . * thi$Break.micaLength)<« margin THEN 

{IF thisBreak.type = hard THEN 

{finaLsuffixChar «- hardHyphen; GOTOsuccessExit}* 
IF width > 3 minWidth THEN 
{finaLsuffixChar hyphen; GOTO successExit}; 
IF width > fit THEN {fit ♦-width; pending ♦-thisBreak}}; 7 
IF{breakNum<*breakNum + 1) = prop$.breakPointCount'THEN 
{IF fit = 0 THEN GOTO noneFltsExit; 
thisBreak*- pending; 
width 4- fit; 

final.suffixChar hyphen; 
GOTOsuccessExit}; 

thisBreak ♦-thisBreak + SI2EETokenProps.BreakPointDataI* 
ENOLOOP; 
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EXITS 

successExit s > 

-add the sylable to the current position and make it the break point 
{final.micatength width; 

finaLpiKe)Length#-.final.pixelLength + thisBreak.pixeILengt^ 
final.byteCount«-final.byteCount + thisBreak.byteCount: 
SaveThisBreakD; 

- now compute the backup to start the next line 
final f <-(IFfinal.$uffixChar = hardHyphen THEN 



(index: 
mtcaLength: 
pixelLength: 
count: 

notPunctuation: 
suffixChar: 
byteCount: 
whiteSpace: 
pndex: 
micaLength: 
pixelLength: 
count: 

notPunctuation: 
suffixOiar 
byteCount: 
whiteSpace: 

noneFitsExft « > 
unabteToBreakExit s > 
specialGermanExit = > 
END 



finalJndex, 

- thts8reak.micaLength, 
* thisBFeak.pixelLength« 
0, 

FALSE, 
null, 

-thIsBreak. byteCount 
0] ELSE 
finalj'ndex, 

- (thi$Break.injcaLength - arg.hyphenMlcaUngth), 

- (this8reak.pixe!Length - arg.hyphenPixelLength), 

FALSE, 
null, 

- thisBreak.byteCount, 
OD}; 

WholeWordBreakO; 
reason <- unableToBreak; 
reason <- spedalGermanCase; 
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- MAIN LOOP OF LineBreak 

final @arg.final; prior @arg.prior; 

UNTIL finaLindex > = arg.textLENGTH DO 
WITH at: arg.text(final.mdexl SELECT FROM 
token X > 

{aM <- (props @arg.propsBase(x.offset]),atomMetrics; 
IF prop$.font # arg.font THEN GOTO invalidPropsExit; 

IF margin < » (width <-final.micaLength + aM.micaLength) THEN GO TO marginExit; 
IF final.notPunctuation AND props.notPunctuation THEN GO TO contiguousExit; 

finat.notPunctuation <- prop$.notPunctuation; 
finat.micaLength <- width; 

final.pixef Length «-final.pixelLength + aM.pixel Length; 
final.byteCount4-finat.byteCount + aM.byteCount; 

IF x.spaceFollows THEN IF ProcessSpace 0 THEN GO TO simpleMarginExit}; 

escape => 
SELECT x.escapeType FROM 

space « > IF ProcessSpace D THEN GOTO simpleMarginExit; 
leroWidthSpace »> {finaJ.suffixChar#-zeroWidthSpace;SaveThi$8r«akn}; 
changeBase s > GO TO changeBaseExit; 
ENDCASE; 
ENOCASE; 

finaLindex <-final.lndex 4> T; 
REPEAT 

SimpleMarginExit = > {reason margin; WholeWordBreakn}; 
changeBaseExit =: > reason 4- changeBase; 
invalidPropsExit = > reason 4- invalidProps; 
marginExH « > {reason margin; HyphenateWordQ}; 
contiguousExit = > reason «- contiguousWords; 
FINISHED o > reason <- normal; 
ENQLOOP; 
END; 
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/* MESA TO C TYPE 8QUZVALEHCES */ 
typedef unsigned short int CAROIRAL; 
typedef unsigned short int OKSPECIFIED; 
typedef unsigned long int LCAItDXRAL; 



typedef short int 
typedef long int 

typedef char 

Idefine KATimAL(_H) 
fdefine BOOLEAN (Jl) 
Idefine TitiiE 
Idefine FALSE 

Idefine Hit 
Idefine BULL 



ZMTEGER; 
LIHTE6ER; 



CHARACTER; 



1 
0 

0 
0 



unsigned int _N : 15 
unsigned int ~]l : 1 



/* T7PES OSBD IH NOROCACRE KESA KOOULB */ 
typedef unsigned char BRBAKTYPB; 



Idefine 


BRK.EARO 


0 


Idefine 


BRK^OISCRET 


1 


Idefine 


brk'best 


3 


Idefine 


BRK^BSTTBR 


3 


Idefine 


bri_good 


4 


Idefine 


BRX.OK 


5 


Idefine 


BRKJBERMAH 


255 



Idefine OFFSBT(_H) 



unsigned int JS tlA 



typedef struct 



{ 



CARDINAL 
CAROIHAL 
CARDINAL 



} 



Bicalength; 
pixel length; 
bytecount; 
ATOH_KETRICS; 
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typedef struct 
( 

BREAKTYPE 

unsigned char 

CASDIHAL 

CARDINAL 

} 

tyi>ede£ struct 
{ 

BOOLEAR{ 
VATaitAL( 

wordjibyiucs 
cardInal 

STLABLE 
} 



type; 

bytecouat; 
micalength; 
pixeXlength; 
BREAKPOINT; 



notpunctuation ); 

font ); 

vordinetrica; 

aylableeount; 

aylableCl; 

OBJECT; 



/* TYPES USED XK LINEBREAK MESA HOOULB */ 

typedef Int REASON; 

Idefine RBAJCARGIN 0 

•define RBAJIQRKAL 1 

Idefine RBA.CBAN6EBASB 2 

Idefine REA.INVALCACBB 3 

Idefine REAjCOHTHOROS 4 

Idefine REa'cantbrbak S 

Idefine RSA GERKAN 6 



Idefine BNTRTTTPE(_N} 

Idefine ENT.WORO 0 

Idefine SNT*SSCAPB 1 

Idefine BSCAPETTPE(.N) 

Idefine ESC.SPACE 0 

Idefine EScIzERMIOSPACE 

Idefine ESC CHANGEBASE 2 



unsigned int Jl si 



unsigned int _N :2 



Idefine saFFIICBAR(_N) 

Idefine SOP.SPACE 0 

Idefine SUpIzBRHIDSPACE 

Idefine SUp'sTPHEN 2 

Idefine SUf'bardhyphbh 3 

Idefine SUFjnSVL 4 



unsigned int _N :3 
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typedef union 
{ 

struct 

{ 

ENTRSTTTPBC 
BOO&EAR( 
OFFSBT( 
} token; 
struct 
( 

BIITRTT7PE( 
KSCAPSTYPEC 



entrytype ); 
spacefollovs ); 
offset ); 



entrytype ); 
typo ); 



unsigned int newbase :13; 
} escape; 
} SHTRY; 

typedef struct 

( 

BHTwr • base; 
CAROIHAL length; 
} EKTRYDSC; 

typedef struct 

{ 

CARDIKAL index; 
I HTBGa R aicalength; 
IRTB6SR pixellength; 
unsigned int count : 12; 
aOOLB«l( notpunctuation ); 
SnFFXICBIUt( suffixchar ); 
IHTEGSR by tecoun t ; 

CAROIHAL whitespace; 
} STATE; 



typedef struct 

{ 

KMTRTDSC 

unsigned char 

BO0LSM( 

irATURJa( 

CAROIEAL 

unsigned char 

unsigned char 
CAROIHAL 
CAROIHAL 
CAROINAL 
STATE 
STATE 



} 



teat; 
* cache; 
hyphenate ); 
font ); 
owrgin; 

hyphenpixel leng th ; 
■iospacepixellength; 
hyphenaiicalength; 
ainspacenicalength; 
whitespace; 
final; 
prior; 
ARGRSC; 
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/• LIHEBREAKIMPt - THIS ROtfTIHE IKPLBMBHTS LIHBBREAKIMPt.MESA IH C */ 



Idef ine HHOLEWOIIOBREAK \ 

( £inal*>index « prioc*>index + 1; \ 

final *>inlcaleiigth • Q; \ 

final ->pixellength - O; \ 

final->count • Q; \ 

fiBal->ttotpuiictuation - FALSE; \ 

£lBal->8affixchac - SUP^HOLL; \ 

final->b7tecount - 0? * \ 

einal->whibespace ■ 0; } 

Idefine SAVETHISBREAK \ 

( fiaal->notpunetuation « rALSB; \ 
•prior » *final; ) 



Idefine PltOCESSSPACB(_a) 

{ final->au£fixeh&r ■ SDF SPACK; 
SAVETHZSBREMC; 

width « final->micalen9th + arg->ainspacemicalength; 
if ( width > nargltt } 

(_B) « TROE; 
else 

{ final*>«lcalength « width; 

final ->pixellftngth +■ arg->fflinspacepixellength; 
final ->eount *m i* 
£ioal->aotpunctuation - PALSB; 
fittal->b7teeouat 1; 
£inal->vhitespace +• arg->whitespBce; 
( _B) - FALSE; ) } J 
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REASON L2N£BREAKIHPL( arg ) 

re9ister AR6REC * arg; 
{ 

OBJECT * vord; 

CARDINAL margin » arg->margin; 

STLABL8 * thisbreakpoint • NULL; 

STLABLB * pending « NULL; 

CARDINAL width; 

CARDINAL breaknum * 0; 

CARDINAL fit - 0; 

CARDINAL minwidth; 

STATE * final ■ fiarg->final; 

STATE * prior - ftarg->prior; 

ATON.METRICS • am; 

BNTRy • X • fi(arg->text.base[final->iadex]); 

CARDINAL bool; 



foc( ; final->index < arg->text. length ; final->index++, «++ ) 
i£(^s->toJEen.entr7t7pe ■« ENT.TOXER ) 

propa • (OBJECT *)(arg->caehe + x->tokea« offset); 
am " «prop«->ato]mtte tries; 

if( props->font l» arg->foat ) 
return( REA.INVALCACHE ); 

width > final-^icaleagth + an->micalength; 
if ( margin <- width ) 

{ 

/• HTPBSHATE MORO */ 
if( final->notpunetuation ) 
retarnC RBA^CANTBRSAE ); 

if props- >breaHpointcount — 0 | i arg->h7phenat« }~ 

HBOLENORDBRBAK; 
retura( RBA.KARGIN |; 

) 

) 

/• IF TEE LAST FULL NORD FITS IN TEE NBITB- 
SPACE TEEN DOH*T BTPHBNATB */ 
if (^(margin - prior->whitespace) <» prior->micaleagth ) 

HHOLEHOROBREAK; 

re turn ( RBA MARGIN ) ; 

} 
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/* PICK TBS BREAK P0IH7 WITH TBE BI6BBST 

DBSIRABItlTY THAT PITS IB THB HBITB SPACE, 
OTBKRWISB PICK TBE ORB OTTB TBE TIGBTBST 
PIT •/ 
minwidth ■ margin - fiaal->vhitespace; 

for( thisbreakpoint « »(props.>breakpolnt [0] ) ;; thisbreakpoint++ 
( 

if ( thi8breakpoiat->tppe «« BRK GERKAM ) 
returai RBA^GERHAN ); ' 

if({¥idth-final->iaicalengtli+thitbraakpoint->aicalength<-inargin) 
i£( thisbreakpoint->t7pa BRK.BARD ) 

flnal->8uf£ixcbar - SBF.BARDBTPBBB; 

break; * 

) 

if( vidth >« ninvidtb ) 
( 

fiaal->auffixchar > SOF_BTPBEM; 
braak; ~ 

} 

if( vidth > 61t ) 

{ 

fit « vidth; 

pending « thisbreakpoint; 

) 

if(^-H-breakauiB ■» prop8->breakpointcouat ) 

if( fit — 0 ) 

( 

HBOLEHORDaRBAX; 
return( RBAJCAB6IB ); 



thisbreakpoint • pending; 
vidth > fit; 

final" >sufCixchar > SOP^BYPBER; 
break; * 

} 

) 
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/• ADO THE BREAKPOZHT TO THB CDRRBHT P05ITI0W AHD 

MMCB IT THE BREAK POIHT •/ 
£laal->micalen9th ■ width; 

£iiial->pij(ellength +- thisbreakpoint->piaeIlength; 
final->bytecount ♦« thlsbceakpoint->bytecount; 
SAVETHISBREAK; 

/* HOf CCaOPUTB THE BACKUP TO START THE 

HSXT LIKE */ 
if(^final->suffixchac «- SOPJUUtDBTPHSH ) 

£inal->aicaXength « (-thisbreakpoint->Bicalen9th); 
fiiial->pixellength « (-thisbreakpoiat->pi«elX«ngth) ; 

alae 

( 

£inal->Bicalen9th - 

(-( thl«br«akpolnt->micalength-arg->hypheiiaicalength) ) ; 
fiaal->pixellengtli « ^ 

^ (-Cthiibreakpolnt->pi«eliength-arg->hyphenplxellength)); 

fiiial->count « 0; 
final->aotpuactuation « FALSE; 
fiaal->iuffischar > SUF^RULL; 
£liial->bytacouiit - (-thTsbreakpoint->bytecount}; 
fiaal->vhite8pace " 0; 

retura( RSA.KARGIH ); 

i£( final ->aotpunctuat ion fit props ->notpunctuatioa ) 
raturaC RBA.CORTWOROS ); 

final->aotpunetuation • props->natpunctuation; 

Cinal->Bicalength ■ width; 

final->piaellength +• aiB->pixelleagth; 

£inal->bytecouBt •»•« an->bytecouat; 
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if( x->token.8pacefollow$ ) 

PROCBSSSPACS( booL ); 
i£( bool ) 

{ 

HBOLEHOROBREAK; 
leturnC REA MARGIN ); 

} 

} 

} 

else 

if {^s->token.entr7typa »- BKT^BSCAPB ) 

if(^ji->escape.typa »« BSC.SPAC8 ) 

PROCBSSSPACE( bool }; 
if( bool ) 
( 

HBOLEMORDBREAK; 
ceturn< REA KARGIR ); 
} 

} 

if( x->escape.t7pe BSC.CBAR6BBASE ) 
retucn( RBA_CHAN6EBASB ); 

if{^x->escape.t7pe «« BSC^ZERWIOSPACB ) 

final->suffixchar ■ SUF ZBRHIOSPACB; 
SAVETHISBRBAK; ~ 

} 

} 

} 

turn( REA.HORKAL ); 
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arg - [textiOBSCRIPTORlll570000B T ,134HX34) [. , . J , 

prop8Baseill421400Bt , hyphenate J TRUE, eoiit:Or ■ar9in:8996, 
hyphenPixelLength:3, minSpacePlxelLength:2, hyphenMicaLength:118, 
■xiiSpaceMicaLength!l28, «hiteSpace:128, 
final: tindexrO, BilcaLength:0, pi»elLength:0, countrO, 

notPunctuationtFALSE, 8uffixChat:null, 

byteCount:Or «hiteSpace:0] , 
prior! tindexaO, nicaLengthtO, pixeXtengthxO, count:0, 

notPunctuation:PALSB, auffixCharsnull, byteCountiO. 

whit«Space:On 

arg » [textiDBSCRlPTORCll570000B T #134] (134) {. . .J , 

propsBa8ejI1421400Bt # hyphenate: TRUE, fontsO, ■argintSdSS, 
hyphenPixelLeagthi3, iolnSpacePi«BlLength:2, hyphettMicatength:118, 
oinSpaceMicaLength:128, »hiteSpacerl28,- 9«.axo, 
final I (index J 10, micaLength:.745, pi«elLength:-2X, count-0 

notPuactuationiPALSE, aufflxChar:null, byteCount:-4 

vhiteSpaee:0] , ' 
prior: [ittdex:10, nical«ngth:8796, pixelLen9thj233, counfS 

notPunctuationrPAtSR, fuffixCharj hyphen, byteCount:51. 

vhiteSpac8:1152]] ' 

arg « C text: DESCRIPTOR (11570000B f ,134] (134) [..^J , 

propsBa8e:114214Q0Bt / hyphenate :TRDE, font:0, Margin:9878, 
hyphenPixelLength:3, «lnSpacePixelLen9thi2, hyphenKicaLength:ll8. 
niinSpaceNicaLength!l28, vhiteSpaee:128, gi^n.^o, 
final: (index: 22, mieaCengthsO, pixelLengthtO, countsO, 

notPunctuation: FALSE, 8uffixChar:null, byteCount:0. 
ifhlteSpace:0], 

prior: [lndex:21, nicaLength:9426, pixelLength:249, eountt9 
notPunctuation!PALSB, suffixChar: apace, bTteCount:54 
vhiteSpace:11521] ' 

arg - I text: DESCRIPTOR (il570000B t #134] (134) [.,. J , 

prep8Ba8e>11421400BT, hyphenate: TRUE, fontrO, aargin:9878 
hphenPixelUmgth:3, iainSpacePixelLength:2, hypheBMicaLength;118, 
■inSpaceHicaLengthsl28, whiteSpace:128, ^^n.ixo, 
final: [index232, micaLength:0, pixelLengthjO, count:0, 

notPunctuationiFALSB, suff ixChar tnull, byteCount:0, 

«hiteSpace:0), 

prior: (index:31, nicaLength:9838, pixeltength:265, count:7, 
notPunctuation: FALSE, suffixChar: space, byteCount:57. 
vhiteSpace:896]] ' 
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[ escape [ changeBase [ newBase : 1 1 J ] r . 

[token (spacePollows:TRDE, offset: 11 U TneQ 

[t ken [spaceFollows:TRUEr offset: 20]], approachD 

(token IspacePollow6:TROEr offset: 44]], usedQ 

(token (spaceFollows:TROE, offset: 63]], nereD 

[token [spaceFollowssTRaS, offset: 82] ]r tsQ 

(token [spaceFollows:TRUE, offset: 101]], toD 

(token [spaceFollows:TRUE, offset: 120]], encodeD 

[token [spaceFollow3:TRUE, offset: 141]], theD 

• 

9 

[token [spaceFollows:TRUE, offset: 230]], aO 

(token [spacePollowstTROE, offset: 550)], spaceQ 

(token [spaceFollow8:TRUE, offset: 342]], orQ 

(token [spaceFollows:TROE, offset: 465]], punctuatJonD 

(token [spacePollow8:TR0B, offset: 512 J], maricD 

[token [spaceFollows:TRUE, offset: 82]], !?P ^ ^ 

[token [spaceFollowsxTRDE, offset: 680]], identifiedG 

(token [spaceFollowstTROB, offset: 709]], withO 

(token (spaceFollows:TROE, offsets 230]], aq 

(token [spaceFollows:TROE, offset: 444]], uniqueQ 
[ escape [ changeBase [ newBase : 2 ] ] ] , 

[token (spacePollow3:FALSE, offset: 22]], toke 
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[escape [ changeBase [ newBase : 
[token [spaceFollows:TROE, 
I escape [ changeBase t newBase : 
[token [spaceFollows-.TROE, 
(token [spacePollows:TRaEr 
[ token [ spacePollows : TRCJE r 
[ token [ spacePollows :TRUEr 
[token [ spacePollows tTRUEr 
[token [spacePollows: TRUE, 



2111, 
offset: 4711, « 

nil. 

offset: 76811 
offset: 99811 
offset: 95011 
offset: 1024 11 
offset: 14111 
offset: 1048 11 



Tokenn 
propertiesQ 
indudeQ 
then 
tasd 



[token tspacePollows:TROE, 

[escape[space[ll]r 

[token [spacePollows: TRUE r 

[ escape [ space [111, 

[ token [ spacePollows : TRUE , 

[ escape [ space (111, 

( token [ spacePollows : TROE , 

[ escape ( space [111, 

( token ( spacePollows i TRUE , 

( escape [ space [ 11], 

[token [spacePollows: FALSE, 

[token (spacePollows: FALSE, 



• 

offset: 1560] I, 

offset: 157911, 

offset: 62911, 

offset: 66111, 

offset :1598]1, 

offset: 16171 If 
offset: 76811 



willD 

separated^ 
□ 

□ 

spaces 
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